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Abstract 


\J 

This  study  examines  the  effects  of  computer 
architecture  on  FFT  algorithm  performance.  The  computer 
architectures  evaluated  are  those  of  the  Cray-1,  CDC  Cyber 
750,  IBM  370/155,  DEC  VAX  11/780,  DEC  POP  11/60,  DEC  PDP 
11/50,  and  Cromemco  Z-2D.  The  algorithms  executed  are  the 
radix-2,  mixed-radix  FFT  (MFFT),  Winograd  Fourier  Transform 
Algorithm  (WFTA),  and  prime  factor  algorithm  (PFA). 

The  execution  time  of  each  algorithm  for  different 

sequence  lengths  is  determined  for  each  computer,  v  The 

--  - 

initialized  UFTA  is  fastest  on  the  Cray-1,  the  radix-2  is 
fastest  on  the  CDC  Cyber  750,  and  the  PFA  is  fastest  on  the 
others.  C.  Then  the  number  of  assembly  language  instructions 
executed  are  determined  for  the  following  categories:  data 

transfers,  floating  point  additions  and  subtractions, 
floating  point  multiplications  and  divisions,  and  integer 
operations.  The  correlation  coefficients  between  the  number 
of  assembly  language  instructions  in  each  category  and  the 
algorithm  execution  speeds  are  determined  for  each  computer. 
The  average  values  of  the  correlation  coefficients  range 
from  0.8614  for  the  floating  multiplications  and  divisions 
to  0.9792  for  the  data  transfers.  '‘The  values  of  the 
correlation  coefficients  are  then  related  to  the  computer 
architectures . 

The  computer  architectures  are  then  compared  against 
each  other  to  determine  what  features  ere  desireable  in  an 
FFT  processor.  The  most  desireable  features  are  assembled 
into  a  proposed  minimum  computer  architecture  for  efficient 


FFT  performance.  The  minimum  architecture  includes  separate 
functional  units,  a  cache  memory,  and  separate  floating 
point  and  integer  registers.  In  addition,  a  method  for 
deciding  how  to  improve  a  given  architecture  is  presented. 
The  method  is  based  on  the  correlation  coefficients  for  that 
architecture.  Guidelines  for  predicting  FFT  algorithm 
performance  are  given  based  on  the  known  computer 
architecture.  Floating  point  processors  execute  the  radix-2 
fastest,  data  transfer  processors  execute  the  PFA  fastest, 
and  vector  processors  execute  the  initialized  WFTA  fastest. 


xii 


I.  Introduction 


Background. 

A  Fourier  transform  is  a  mathematical  method  of 
determining  the  frequency  content  of  a  given  time  domain 
signal  representation.  Fourier  transforms  are  commonly 
found  in  many  signal  processing  applications.  But  Fourier 
transforms,  which  are  continuous  functions,  cannot  be 
calculated  directly  on  a  digital  computer.  To  overcome  this 
problem,  and  take  advantage  of  the  processing  power  of 
digital  computers,  discrete  Fourier  transforms  (DFTs)  were 
developed.  The  equation  defining  the  DFT  is 
N-l 

X(k)«  l  x(n)exp(-j  2irnk/N)  k-0,1 . N-l  ( 1-1 ) 

n  =  0 

where  x(n)  consists  of  N  samples  of  a  finite  length  sequence 
and  X(k)  consists  of  N  frequency  components  of  the 

sequence's  Fourier  transform.  Direct  evaluation  of  (1-1) 

2 

requires  N  complex  operations,  where  a  complex  operation  is 
defined  as  a  complex  addition  and  a  complex  multiplication. 
Computationally  efficient  algorithms  have  been  devised  to 
reduce  the  number  of  operations  required  to  calculate  the 
DFT.  These  DFT  algorithms  have  come  to  be  known  as  Fast 
Fourier  Transform  (FFT)  algorithms.  The  first,  developed  by 
Cooley  and  Tukey  in  1965,  required  only  Nlog2N  complex 
operations  (Cooley  and  Tukey,  1965).  Since  then,  many  new 
FFT  algorithms  have  been  devised. 

The  purpose  of  an  FFT  is  to  calculate  the  Fourier 
Transform  of  an  input  sequence  as  quickly  as  possible.  Uith 
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the  introduction  of  various  computer  architectures,  some 
algorit  ms  have  been  reported  to  perform  better  than  others 
on  particular  computers.  But  no  specific  and  detailed 
comparisons  have  been  made  between  the  major  FFT  algorithms 
and  the  computer  architectures.  The  process  of  optimizing 
an  algorithm  for  a  specific  computer  architecture,  or 
conversely,  optimizing  a  computer  architecture  for  a 
specific  Fourier  Transform  algorithm,  is  important  in  the 
design  of  dedicated  FFT  processors.  Since  FFT  computations 
are  required  in  many  signal  processing  applications,  the 
optimized  design  of  a  dedicated  FFT  processor  will  greatly 
increase  throughput  and  decrease  costs  of  signal  processing 
systems . 

Problem. 

In  the  past,  ail  FFT  algorithm  performance  comparisons 
have  been  made  on  the  basis  of  executing  the  programs  on  one 
particular  computer  architecture.  This  method  gives  a 
ranking  of  the  algorithms  but  does  not  investigate  the 
effects  of  the  computer  architecture  on  the  algorithm 
performance.  This  study  has  three  purposes.  The  first  is 
to  determine  the  performance  rankings,  based  on  execution 
speed,  of  each  of  the  four  most  widely  used  algorithms  as 
they  are  executed  on  different  computer  architectures.  This 
will  allow  FFT  users  to  select  the  fastest  algorithm  given 
the  computer  architecture  available.  The  second  is  to 
determine  the  relationship  between  the  computer  architecture 
and  the  algorithm  performance.  This  will  provide 
information  to  those  trying  to  optimize  the  implementation 


of  the  algorithm  or  the  architecture  for  maximum 
performance.  The  final  purpose  is  to  determine  the  minimum 
architecture  required  for  efficient  execution  of  FFT 
algorithms.  This  will  provide  a  guideline  for  those 
designing  or  acquiring  computer  systems  for  the  purpose  of 
performing  FFTs . 

Scope . 

This  study  examines  the  execution  of  four  algorithms  on 
seven  different  computers.  The  algorithms  are  a  radix-2 
algorithm,  the  Singleton  mixed-radix  algorithm,  the  Uinograd 
Fourier  Transform  algorithm,  and  the  Burrus  prime  factor 
algorithm.  The  seven  computers  are  a  Cray-1,  a  CDC  Cyber 
750,  an  IBM  370/155,  a  DEC  VAX  11/780,  a  DEC  PDP  11/60,  a 
DEC  PDP  11/50,  and  a  Cromemco  Z-2D.  The  algorithms  are 
evaluated  for  a  selected  set  of  sequence  lengths:  512, 
1024,  and  2048  for  the  radix-2;  504,  630,  1008,  1260,  and 
2520  for  all  other  algorithms.  These  sequence  lengths 
provide  representative  samples,  and  the  information  found 
can  be  extrapolated  to  other  sequence  lengths.  The  sequence 
lengths  were  chosen  so  that  those  for  the  radix-2  are  as 
close  as  possible  to  the  other  algorithms,  thus  allowing  a 
direct  comparison  between  all  the  algorithms.  This  study 
presents  the  execution  time  of  each  of  the  algorithms  on 
each  of  the  computers.  In  addition,  a  count  of  the  various 
assembly  language  instructions  was  made  for  each  of  t».e 
algorithms.  This  study  does  not  try  to  determine  which 
algorithm  is  the  best,  or  which  computer  architecture  is  the 
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best,  but  determines  which  algorithms  match  which 
architectures  and  also  determines  which  hardware  features 
are  needed  for  fast  execution  of  any  FFT  algorithm. 
Assumptions . 

This  study  assumes  that  the  CPU  clocks  are  as  accurate 
as  published  in  the  manufacturer's  manuals.  Also,  the 
impact  of  the  operating  system  overhead  is  negligible 
because  the  implementation  of  the  algorithm  is  a  CPU 
intensive  job  and  each  implementation  was  executed  either  as 
the  only  job  in  the  system  or  at  a  time  of  minimal  activity. 
In  addition,  this  study  assumes  that  the  manufacturer's 
published  instruction  timings  are  accurate. 

Methodology. 

The  four  major  FFT  algorithms  are  compared  to  each 
other  on  different  computer  architectures.  The  execution 
speed,  memory  requirements,  and  instruction  counts  for  the 
four  algorithms  are  compared.  The  algorithms  studied  are  a 
decimation-in-time  radix  two  algorithm,  the  Singleton  mixed 
radix  algorit  m  (MFFT),  the  Winograd  Fourier  Transform 
algorithm  (WFTA),  and  the  Burrus  prime  factor  algorithm 
(PFA).  Each  algorithm  is  studied  for  different  sequence 
lengths  on  seven  different  computers:  a  Cray-1,  a  CDC  Cyber 
750,  an  IBM  370/155,  a  DEC  VAX  11/780,  a  DEC  PDP  11/60,  a 
DEC  PDP  11/50,  and  a  Cromemco  Z-2D.  The  Cray-1  and  CDC 
Cyber  750  were  chosen  as  examples  of  scientific  mainframe 
computers.  The  IBM  370/155  was  chosen  as  an  example  of  a 
general  purpose  mainframe  computer.  The  DEC  VAX  11/780  was 
chosen  as  an  example  of  a  new  generation  super  minicomputer. 
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The  DEC  PDP  11/60  and  PDP  11/50  were  chosen  as  examples  of 
minicomputers.  The  Cromemco  Z-2D  is  an  example  of  a 
microcomputer.  Comparisons  are  made  among  the  results  and  a 
determination  made  as  to  what  hardware  features  are 
important  for  which  algorithms. 

The  execution  speed  of  each  of  the  algorithms  was 
determined  by  using  the  FORTRAN  library  subroutine  calls  to 
the  real-time  clock.  The  clock  was  called  at  the  beginning 
of  the  algorithm  and  initialized  to  zero.  At  the  end  of  the 
algorithm  the  clock  was  called  again  and  the  elapsed  CPU 
time  since  the  start  of  the  algorithm  determined.  For  the 
Cromemco  Z-2D,  which  does  not  have  a  complete  FORTRAN 
subroutine  library,  the  execution  time  was  determined  by 
manually  timing  the  algorithm  with  a  stopwatch.  For  the  DEC 
VAX  11/780,  which  also  does  not  have  a  complete  FORTRAN 
subroutine  library,  the  execution  time  was  determined  by 
writing  a  subroutine  in  the  C  programming  language  which 
accessed  the  real  time  clock  and  could  be  called  from  the 
FOR  RAN  program.  These  times  are  compared  and  ranked  for 
the  different  algorithms  on  each  computer. 

The  instruction  counts  for  each  computer  were 
determined  with  the  help  of  the  FORTRAN  compilers.  All  of 
the  FORTRAN  compilers  had  an  option  for  generating  an 
assembly  language  listing.  Once  this  listing  was  obtained, 
the  flow  of  execution  was  determined  and  the  number  of  times 
an  instruction  was  executed  was  counted  by  hand.  Once  the 
number  of  different  types  of  instructions  was  known,  the 
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execution  speed  for  other  sequence  lengths  can  be  predicted, 
if  the  architecture  and  instruction  cycle  times  are  known. 
Presentation . 

Chapter  2  presents  a  review  of  the  current  literature. 
This  literature  review  summarizes  the  current  research 

efforts  in  the  areas  of  FFT  algorithm  development,  FFT 

* 

processor  architecture,  and  the  relationship  between  the 
two.  Chapter  3  analyzes  each  of  the  four  algorithms  in 
detail.  The  radix-2,  MFFT,  WF.A,  and  PFA  are  introduced  and 
the  theory  of  operation  explained.  The  development  of  the 
number  of  times  each  section  of  the  algorithm  is  executed  is 
presented.  Chapter  4  describes  the  computer  architecture  of 
each  of  the  seven  computers  used.  The  hardware  features  are 
described,  as  well  as  the  operating  system  and  compiler  used 
to  execute  the  FFT  algorithms.  Chapter  5  presents  the 
results  of  executing  each  of  the  four  algorithms  on  each  of 
the  seven  computers.  The  execution  speeds  for  each  of  the 
algorithms  and  sequence  lengths  is  presented,  as  well  as  the 
instruction  counts.  Correlation  coefficients  are  determined 
between  the  execution  speeds  and  the  instruction  counts. 
Chapter  6  compares  the  results  of  each  computer  against  the 
other  in  order  to  determine  the  important  architectural 
features.  The  differences  in  architecture  are  related  to 
the  differences  in  performance  of  the  algorithms.  Chapter  7 
takes  the  important  architectural  features  and  combines  them 
into  the  minimum  architecture  necessary  for  efficient  FFT 
execution.  This  architecture  could  be  used  as  a  guideline 
in  the  design  of  dedicated  FFT  processors.  Chapter  8 
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presents  some  guidelines  for  predicting  which  algorithm  will 
be  the  fastest  based  on  the  computer  architecture.  Chapter 
9  presents  the  conclusions  drawn  and  recommendations  for 
further  study. 
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II.  Literature  Review 


Fourier  transforms  have  an  important  role  in  electrical 
engineering.  With  a  Fourier  transform,  information  in  the 
time  domain  can  be  converted  to  the  frequency  domain.  Often 
the  frequency  domain  representation  is  more  useful  and 
easier  to  analyze.  But  calculating  the  Fourier  Transform  of 
an  arbitrary  time  domain  signal  has  always  been  a  lengthy 
and  difficult  process.  The  appearance  of  digital  computers, 
with  their  speed  and  ease  of  programming,  has  led  to  the 
development  of  many  algorithms  for  computing  a  discrete 
Fourier  transform  (DFT).  However,  if  signal  processing 
requirements  increase  faster  than  hardware  and  algorithms 
are  improved,  many  of  these  algorithms  may  not  be  usable  due 
to  their  slow  speed  or  large  memory  requirements. 
Therefore,  research  is  being  conducted  on  evaluating  the 
currently  available  algorithms  as  well  as  developing  new  and 
better  algorithms 
Algorithms 

The  introduction  of  digital  computers  initiated  the 
search  for  DFT  algorithms.  The  first  algorithm  developed 
was  based  on  a  sequence  length  which  was  a  power  of  two 
(Cooley  and  Tukey,  1965).  This  algorithm  reduced  the  number 

p 

of  complex  operations  from  N  to  Nlog2N.  However,  by  taking 
advantage  of  the  symmetry  of  exp(-J2irk/M  ),  the  number  of 
complex  multiplications  can  be  reduced  to  (N/2)log2N,  while 
the  number  of  complex  additions  stays  the  same.  Since  then, 
other  algorithms  have  been  developed  in  which  the  sequence 


length  is  not  restricted  to  being  a  power  of  two.  The  newer 
algorithms  are  more  flexible,  but  may  be  slower. 

Singleton  developed  an  algorithm  which  uses  mixed 
radixes  (Singleton,  1969).  Singleton  used  an  algorithm  by 
Gentleman  and  Sande  to  develop  a  mixed  radix  algorithm  which 
uses  the  twiddle  factors  in  order  (Gentleman  and  Sande, 

1966).  In  addition,  the  number  of  complex  multiplications 

2  2 

is  reduced  from  (p-1)  to  (p-l)  /4  for  odd  p,  where  p 
is  a  factor  of  the  sequence  length  N.  A  final  permutation 
is  used  to  arrange  the  results  in  order. 

Kolba  and  Parks  developed  a  prime  factor  algorithm 
which  is  better  than  Singleton's  if  the  sequence  length  can 
be  factored  in  a  particular  way  (Kolba  and  Parks,  1977). 
This  algorithm  is  based  on  the  idea  of  circular  convolution, 
and  uses  the  Chinese  Remainder  Theorem  to  map  the  one 
dimensional  sequence  into  multiple  dimensions.  This 
algorithm  was  found  to  be  faster  than  Singleton's  mixed 
radix  algorithm. 

Most  recently,  Winograd  developed  and  algorithm  using  a 
series  of  nested  multiplications  (Silverman,  1977). 
Uinograd  developed  short  DFT  algorithms  based  on  more 
efficient  convolution  algorithms  using  a  reduced  number  of 
multiplications.  This  algorithm  reduces  the  number  of 
multiplications  while  not  increasing  the  number  of 
additions. 

Research  has  been  done  in  comparing  these  algorithms  in 
the  areas  of  number  of  multiplications  and  additions  and  the 
array  storage  requirements  (Blanken  and  Rustan,  1982). 
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Blanken  and  Rustan  found  that  the  MFFT  is  the  roost  flexible 
algorithm  and  uses  less  memory  than  either  WFTA  or  PFA. 
Expressions  for  the  array  storage  requirements  for  each  of 
the  three  algorithms  were  determined.  WFTA  and  PFA  require 
fewer  real  operations  than  MFFT.  Blanken  and  Rustan  state 
that  selection  of  the  most  efficient  algorithm  in  terms  of 
speed  depends  on  the  machine  speed  of  real  additions  versus 
real  multiplications.  However,  they  did  not  determine  what 
the  dependence  was. 

Architecture 

Most  of  the  studies  which  deal  with  execution  times 
have  only  looked  at  two  of  the  algorithms  at  a  time 
(Morris,  1978).  Morris  examined  the  WFTA  and  a  radix-4 
algorithm  on  both  a  DEC  PDP  11/55  and  an  IBM  370/168.  He 
found  that  the  radix-4  was  better  than  WFTA  both  in  terms  of 
speed  and  memory  requirements.  Morris  does  relate  the 
algorithm  performance  to  the  computer  architecture;  however, 
since  the  architectures  are  similar,  the  effects  of  the 
architecture  cannot  readily  be  determined.  He  also  begins 
to  explore  the  effects  of  the  FORTRAN  compiler  on  the 
algorithm  performance,  and  states  that  algorithms  cannot 
always  be  judged  strictly  in  terms  of  the  number  of 
arithmetic  operations. 

Lately  though,  more  people  have  become  interested  in 
comparing  these  algorithms  to  each  other  on  a  single 
computer.  Burrus  and  Eschenbacher  compared  all  four  major 
algorithms  in  terms  of  speed  and  memory  requirements  on  a 


II-3 


DEC  PuP  11/45  minicomputer  (Burrus  and  Eschenbacher ,  1981). 
They  found  the  PFA  to  be  faster  than  the  MFFT,  radix-4,  or 
Winograd  algorithms.  Also,  since  the  PFA  uses  short  DFT 
modules,  unused  modules  can  be  removed  when  a  fixed  sequence 
length  is  to  be  calculated,  or  additional  modules  can  be 
added  to  increase  the  number  of  sequence  lengths  that  can  be 
transformed.  However,  only  one  computer  was  used.  Thus, 
the  effects  of  the  computer  architecture  on  the  algorithm 
performance  could  not  be  determined.  They  also  determined 
the  number  of  real  operations  for  each  algorithm. 

The  recent  studies  also  began  to  relate  the  structure 
of  the  algorithm  to  the  hardware  architecture  of  various 
common  computers.  Route  began  by  comparing  four  algorithms 
on  four  different  medium-  to  large-sized  computers  (Route, 
1981).  He  then  tried  to  relate  the  speed  and  memory 
requirements  of  each  algorithm  to  the  architecture  of  each 
computer.  Route  was  able  to  rank  the  algorithms  against 
each  other  and  also  between  the  different  computers,  using 
speed  and  memory  requirements  as  the  basic  criteria.  He 
found  that  the  choice  of  algorithm  is  important  in 
maximizing  computer  efficiency.  Increasing  efficiency  could 
decrease  the  cost  and  effort  required  in  performing  a 
particular  signal  processing  task.  Route  found  that  the 
ranking  of  algorithms  by  execution  speed  could  be  different 
on  different  computers.  He  found  that  UFTA  was  fastest  on  a 
Cray-1,  radix-2  was  fastest  on  a  CDC  Cyber  750,  and  PFA  was 
fastest  on  an  IBM  370/155  and  DEC  PDP  11/60.  He  then 
related  the  algorithm  performance  to  the  computer 
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architecture  by  determining  the  instruction  cycle  times  and 
instruction  counts  for  each  algorithm  and  computer.  Route 
emphasized  the  effect  of  data  transfers  on  the  algorithm 
performance . 

Research  is  also  being  conducted  on  developing  faster 
algorithms.  Most  of  these  algorithms  are  being  tailored  to 
the  computer  architecture  on  which  they  will  run.  However, 
once  an  algorithm  has  been  developed  for  a  particular 
architecture,  it  cannot  be  transferred  easily  to  another 
computer.  Chu  and  Burrus  have  developed  an  algorithm  which 
can  be  executed  on  a  distributed  microprocessor  system  (Chu 
and  Burrus,  1982).  The  algorithm  is  based  on  the  PFA,  and 
has  been  modified  to  take  advantage  of  distributed 
arithmetic.  The  calculation  of  the  DFT  is  converted  to  a 
convolution.  The  algorithm  was  programmed  on  a  Z-80 
microprocessor  and  was  10-20  times  faster  than  a  radix-2 
algorithm  and  2-5  times  faster  than  a  standard  PFA 
algorithm . 

Also,  traditional  algorithms,  such  as  the  radix-2 
algorithm,  are  being  studied  and  modified  to  try  to  improve 
their  performance  (Preuss,  1982).  This  type  of  improvement 
depends  more  on  mathematical  advances  than  on  technological 
advances.  Recent  advances  in  software  are  also  being  used 
to  improve  performance  (Johnson  and  Burrus,  1982).  These 
advances  use  the  latest  software  design  techniques  and 
higher  order  language  capabilities  to  optimize  the  program. 

Computer  architecture  will  continue  to  change. 
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Likewise,  the  demands  placed  on  their  signal  processing 
capabilities  will  continue  to  increase.  With  the 
availability  of  microcomputers,  the  current  trend  is  towards 
using  a  microcomputer  dedicated  to  calculating  a  discrete 
Fourier  Transform.  In  order  to  increase  the  speed  and 
throughput,  and  to  decrease  the  cost  and  memory  requirement, 
the  algorithm  must  match  the  architecture.  Usually,  the 
matching  of  algorithm  to  architecture  is  difficult  to 
predict  theoretically.  Thus  the  major  source  of  information 
is  the  experimental  results  obtained  by  executing  the 
algorithms  on  various  computers. 
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III.  FFT  Algorithms 


This  chapter  presents  an  analysis  of  each  of  the  four 
FFT  algorithms  used  in  this  study. 

Background 

The  basic  equation  for  the  evaluation  of  a  discrete 
Fourier  transform  is 

N-l 

X(k)» Z x(n)exp(-j2^kn/N)  (3-1) 

n*0 

where  n  is  the  time  domain  index,  x(n)  is  the  time  domain 
sample  corresponding  to  n,  k  is  the  frequency  domain  index, 
X(k)  is  the  frequency  domain  sample  corresponding  to  k,  and 
N  is  the  number  of  time  domain  samples  which  is  also  the 
number  of  frequency  domain  samples  (Oppenheim  and  Schafer, 
1975).  The  equation  is  usually  simplified  by  letting 

WN»exp(-j2TT/N)  (3-2) 

thus  giving 

N-i  k_ 

X(k)-  Ex(n)W*n  (3-3) 

n-0 

For  a  particular  value  of  k,  x(n)  must  be  multiplied  by 
a  total  of  N  times.  If  x(n)  is  complex,  i.e.,  consists 
of  a  real  and  imaginary  part,  then  each  multiplication  of 
x(n)  by  requires  four  real  multiplications.  Thus,  the 

evaluation  of  X(k)  for  a  particular  value  of  k  requires  4N 
real  multiplications.  Since  there  are  N  different  X(k) 

terms  to  be  evaluated,  the  total  number  of  real 

2 

multiplications  which  must  be  performed  is  4N  .  Likewise, 


the  number  of  real  additions  to  be  performed  is  N(4N-2). 

Most  approaches  to  improving  the  efficiency  of  a  DFT 
algorithm  exploit  either  the  symmetry  or  the  periodicity,  or 
both,  of  the  exponential  term.  The  symmetry  is  shown  by 


k(S-n)  kn,* 


(3-3) 


* 

where  (  )  means  the  complex  conjugate  of  the  quantity 

enclosed.  The  periodicity  is  shown  by  the  identity 


u  kn=u  (k+N)n 

WM  w  \j  -w:i 


(3-4) 


Cooley  and  Tukey  were  the  first  to  develop  an  algorithm 
which  greatly  reduced  the  number  of  operations  required  to 
calculate  a  DFT  (Cooley  and  Tukey,  1965).  The  fundamental 
principle  used  in  reducing  the  number  of  operations  is 
decomposing  the  sequence  length  into  a  number  of  smaller 
discrete  transforms.  The  two  methods  of  decomposing  the 
sequence  length  are  the  decimation-in-time ,  where  the  time 
sequence  x(n)  is  decomposed,  and  the  decimation-in¬ 

frequency,  where  the  frequency  sequence  X(k)  is  decomposed. 
These  algorithms  with  reduced  operations  counts  have  the 
number  of  real  computations  roughly  proportional  to  Nlog2N 
and  are  generally  known  as  fast  Fourier  transforms  (FFTs). 
Radix-2 

The  first  algorithm  evaluated  is  the  in-place 
decimation-in-time  radix-2  algorithm  developed  by  Rabiner 
and  Gold  (Rabiner  and  Gold,  1975),  which  is  similar  to  the 
one  developed  by  Cooley  and  Tukey  (Cooley  and  Tukey,  1965), 
with  a  permutation,  or  bit  reversal,  performed  on  the  input 


sequence.  For  an  N  length  sequence,  the  algorithm  requires 
(N/2)log2N  complex  multiplications  and  Nlog2N  complex 
additions.  The  bit  reversal  interchanges  are  performed 
(N-2**[m/2])/2  times,  where  [  ]  means  the  smallest  integer 
greater  than  or  equal  to  the  quantity  enclosed  and  m  equals 
log2N.  The  number  of  times  each  section  of  the  radix-2 
algorithm  is  executed  is  listed  in  Table  3.1  (Route,  1981). 
These  equations  will  be  used  in  determining  the  number  of 
times  various  assembly  language  instructions  are  executed. 
Singleton  * s  Mixed  Radix 

The  second  algorithm  evaluated  is  the  Singleton  mixed- 
radix  algorithm  (MFFT)  (Singleton,  1979),  which  is  a 
decimation-in-frequency  algorithm  since  it  decomposes  the 
frequency  index  instead  of  the  time  index.  The  algorithm 
decomposes  the  sequence  length  into  square  factors,  odd 
prime  integers,  and  square  free  factors  of  prime  integers. 
Then,  the  appropriate  short  transforms  are  performed.  The 
algorithm  requires  N(p ^pgt. . ,+pm-m)+N(m-l)  complex 
multiplications  and  N  (  p  j+p  . . . +p  m-m )  complex  additions, 
where  pi  is  the  ith  factor  of  the  sequence  length  and  m  is 
the  number  of  factors  (Oppenheim  and  Schafer,  1975).  The 
number  of  times  each  section  is  executed  is  listed  in  Table 
3.2  (Route,  1981). 

The  Singleton  mixed  radix  algorithm  is  based  on  the 
method  proposed  by  Cooley  and  Tukey  (Cooley  and  Tukey, 
1965).  The  sequence  length  N  is  factored  into  m  different 


prime  factors  as 


TABLE  3.2 

Number  of  Executions  -  Mixed  Radix  Program 


Program  Section 

Number  of  Times 
Executed 

Initialize  Difference 

Equations  Constants 

m 

Radix-2  Butterfly 

Without  Twiddle  Factor 

Radix-2  Butterfly 

With  Twiddle  Factor 

(N/ni>  (n^-D-n^n^.  •  .n^  ^(n^l) 

Radix-2  Twiddle  Factors 
Calculated  by  Library 

Calls 

TWCALi 

32 

where  TWCAL^  * 

(KSPAN  -l)-(KSPANi-1)  -  1 

2 

Radix-2  Twiddle  Factors 
Calculated  by  Difference 
Equations 

TWCAL 

TWCALi  - 

Radix-4  Butterfly 

With  Twiddle  Factors 

(  (N/ni>  (n^-D-n^n^ . .  .n^_^  (n^l)  )/3 

Radlx-4  Butterfly 

Without  Twiddle  Factors 

(n1n2  . .  •ni_1(ni-l)  )/3 

Radix-4  Twiddle  Factors 
Calculated  by  Library 

Calls 

(KSPAN^-1) 

32 

Radix-4  Twiddle  Factors 
Calculated  by  Difference 
Equations 

(KSPAN  -1) 

(KSPAN  -1)  -  1 

i  32 

Odd  Transform  Input 
Coefficients  Additions 

(  (p-1)  /  2)  •  (N/p) 

Odd  Transform  Cosine 
and  Sine  Multiplications 

(  (p-1)  /  2)2  •  (N/p) 

Odd  Transform  Resultant 
Coefficient  Additions 

(  (p-1)  /  2)  •  (N/p) 
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TABLE  3.2  CONTINUED 


Program  Section 

Number  of  Times 
Executed 

Odd  Factor  Twiddle 

Factor  Multiplications 

(N/ni>  (n^-1)  -n^..  .u4_1  n^j^Cn  -1) 

Odd  Factor  Unique 

Twiddle  Factor 

Calculations 

(KSPAN^l)  (n^l) 

Odd  Factor  Twiddle  Factors 
Calculated  by  Library  Calls 

(KSPAN^-2) 

32 

Odd  Factor  Twiddle  Factors 
Calculated  by  Difference 
Equations 

(KSPAN  -2)  -  «SPA\-2) 

1  32 

Square-Factor  Pairwise 
Interchanges 

(N-n  n2n3n^n5>  /  2 

Digit-Reversal  of 

Square-Free  Indices 

(n3n4n5'1) 

Temporary  Store  of  Initial 
Square-Free  Subsequence 
Coefficients 

(nin2)2  •  PC 

Reordering  of  Square-Free 
Subsequence  Coefficients 

2 

(nin2)  (n^n^n^-PC-S) 

(Route,  1981) 

m  *  Number  of  Factors 

N  *  n,n-n— • .  .n  _  n 
1  l  J  m-1  m 

KSPANi  «  ^nin2n3’  *  'ni 


■  Factor  Currently  Being  Transformed 
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(3-5) 


m 

N"  H  n  , 
i-1  1 


The  transform  is  then  decomposed  into  m  steps  with  N/n^ 
transformations  of  size  n^.  For  the  radix-2  algorithm,  n  ^*2 
for  all  values  of  i,  but  that  need  not  be  the  case.  The 
transform  is  computed  in  place,  and  a  permutation  is 
performed  to  rearrange  the  results  in  order. 

Singleton  proposed  that  the  transform  be  expressed  as  a 
matrix  multiplication 

X-Tx  (3-6) 
where  x  is  the  input  sequence  as  a  vector,  X  is  the  output 
sequence  as  a  vector,  and  T  is  an  n  by  n  matrix  of  complex 
exponentials  of  the  form 


tkl»exp(-j2Tikl/N)  (3-7) 

The  matrix  T  can  then  be  factored  into 

T"PFmFm-r'*F2Fl  (3"8) 

where  Fi  is  the  transform  step  corresponding  to  the  factor 

ni  of  N  and  P  is  the  permutation  matrix.  Then,  each  F  ± 

matrix  can  be  partitioned  into  N/n  i  square  submatrices  of 

dimension  N.  This  is  the  basis  for  Singleton's  mixed  radix 

FFT  algorithm.  In  addition,  we  have  F  j-R  jT  where  R  j_  is  a 

diagonal  matrix  of  rotation,  or  twiddle,  factors.  When  the 

rotation  factors  are  separated  into  a  matrix  in  this  way, 

the  trigonometric  identities  for  the  exponential  form  can  be 

used  to  simplify  the  computations.  For  example,  a 

multiplication  by  exp(-j2ir)  can  be  eliminated  since 

exp(-j2ir)*l .  The  final  matrix  expression  becomes 
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X-PR  T  R  .T  . .  .  R  „T  R  T  ,x  (3-9) 

mmm-lm-1  2211 

This  equation  is  implemented  in  the  FORTRAN  code,  a  listing 
of  which  is  contained  in  Appendix  B. 

Winograd  Algorithm 

The  Winograd  algorithm  (WFTA)  (McClellan  and  Nawab, 
1979)  is  based  on  circular  convolution,  with  the  circular 
convolution  of  two  sequences  being  equivalent  to  determining 
the  N  coefficients  of  a  linear  polynomial.  The  sequence  is 
decomposed  into  a  series  of  short  transforms  by  the  Chinese 
Remainder  Theorem,  with  the  multiplications  nested  inside 
the  input  and  output  additions.  The  number  of  times  each 
section  is  executed  is  listed  in  Table  3.3  (Route,  1981). 

In  the  Winograd  algorithm,  the  number  of 
multiplications  is  reduced  while  the  number  of  additions 
remains  at  the  typical  FFT  level  of  Nlog^  (Silverman, 
1977).  Like  the  Singleton  mixed  radix  algorithm,  the 
Winograd  algorithm  uses  a  matrix  equation 


x-v 


(3-10) 


where  x  and  X  are  column  vectors  of  the  input  and  output 
values  respectively  and  is  an  N  by  N  matrix  with  elements 


DN(t,r)-W.,lr-W;1(lr)'no« 


(3-11) 


The  matrix  can  be  decomposed  into  the  form 

(3-12) 

where  T^  is  a  J  by  N  incidence  matrix,  C«j  is  a  J  by  J 
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diagonal  matrix,  and  is  a  N  by  J  incidence  matrix.  The 

problem  is  to  determine  a  solution  to  (3-12)  where  J  is  much 

2 

smaller  than  N  .  Winograd  used  field  theory  to  show  that 
for  J  approximately  equal  to  N,  the  number  of 
multiplications  is  on  the  order  of  N.  However,  the  number 
of  additions  is  on  the  order  of  Nlog  N,  which  is  the  same  as 
a  typical  FFT.  Winograd  introduced  a  method  of  breaking  a  N 
length  DFT  into  a  nested  series  of  smaller  length  DFTs . 
With  this  method,  the  number  of  multiplications  does  not 
grow  as  quickly  with  the  sequence  length  N  as  in  a  typical 
FFT.  However,  this  method  does  involve  the  reordering  of 
the  data  both  before  and  after  processing.  This  reordering 
can  be  easily  accomplished  if  N  is  factorable  into  mutually 
prime  factors  as  described  by  the  Chinese  Remainder  Theorem. 
The  Chinese  Remainder  Theorem  states  that  if  an  integer 
modulus  is  factored  into  two  (or  more)  relatively  prime 
factors  N=N1  N2  and  the  residues  of  a  smaller  number  n  are 
evaluated  modulus  these  two  factors,  i.  e., 

nls*<n>N  an<*  "  2s<n>  N  (3-13) 

then  the  original  number  can  be  reconstructed  by  the  formula 

n»<K  xn  j+K  jn  2>  n  (3-14) 

for  suitably  chosen 

Using  the  short  transforms,  the  matrix  equation  becomes 

X'-(D  *D„  *...*D„)x’  (3-15) 

-N1  ,N1-1  1 

Substituting  equation  (3-12)  into  (3-13)  gives 
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(3-16) 


X'-(S„  C„T„*S„  C 
•'l  1  Jl-l 


N 


1-1  1-1 


•*SnCnTm  )x' 
1  1  1 


Using  the  property  of  Kronecker  products  (Thrall  and 

Tornheira,  1957),  we  get  the  fundamental  result  of  the  WFTA. 

x»-(s:ii*sNi_*. .  .*srii)(cM  .  •*cn1)(tn1*t:i1.1*‘  •  -*T 

(3-17) 

or  equivalently 

X'.S:IC,,V  (3-l8) 

The  total  number  of  multiplications  is  the  product  of  the 
number  of  multiplications  for  the  individual  small  N 
algorithms.  In  other  words, 

M-M  Mv  . . . M  (3-19) 

1  1-1  1 

Since  the  diagonal  elements  of  C  are  either  strictly  real 
or  strictly  imaginary,  all  multiplications  can  be  performed 
using  scalars.  A  listing  of  the  program  is  included  in 
Appendix  C.  The  difference  between  the  WFTA1  and  WFTA2 
algorithms  used  in  this  study  is  that  the  WFTA1  includes 

code  to  initialize  some  of  the  matrices  for  a  particular 

sequence  length. 

Prime  Factor  Algorithm 

The  final  algorithm  evaluated  is  the  prime  factor 
algorithm  (PFA)  (Burrus  and  Eschenbacher ,  1981),  which  is 

also  based  on  circular  convolution  and  uses  the  Chinese 
Remainder  Theorem  to  decompose  the  sequence  into  short 
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transforms.  However,  the  prime  factor  algorithm  is  more 
similar  to  the  radix-2  and  MFFT  than  to  the  WFTA.  Unlike 
the  MFFT  and  WF_A,  the  version  of  the  PFA  evaluated  does  not 
decompose  the  sequence  length  N,  i.  e.,  the  factors  of  N 
must  be  specified.  The  algorithm  can  still  be  compared  to 
the  other  three  since  the  decomposition  accounts  for  less 
than  1%  of  the  execution  time  (Route,  1981).  After  the 
transform  is  complete,  the  output  sequence  is  permuted,  or 
unscrambled,  into  the  proper  order.  The  number  of  times 
each  section  is  executed  is  listed  in  Table  3.4  (Route, 
1981).  The  permutation  is  required  because  the  PFA  is  an 
in-place  algorithm.  The  reordering  or  unscrambling  is  done 
by  the  use  of  a  generalization  of  the  Chinese  Remainder 
Theorem.  The  PFA  is  a  completely  new  algorithm  with  no 
counterpart  in  the  Cooley-Tukey  approach. 

Like  all  other  FFTs,  the  PFA  starts  with  the  basic 
equation 


N-l  . 
c(k)«  Ex(n)WM 
n«0 

Then,  the  following  change  of  variables  is  made. 

n»<Kjn  i+Kgng^ 

k*<K3kl+K4k2>N 


(3-20) 


(3-21) 

(3-22) 


and 


x(n1,n2)-x(<K1n1+K2n2>N)  (3-23) 

c(k1,k2)-c(<K3k1+K4k2>,<)  (3-24) 
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factor  being  transformed. 


where  <  >  means  the  quantity  enclosed  is  evaluated  modulo 

N.  Thus,  substituting  these  into  equation  (3-18)  gives 

^  N  -1  N  -1  K. K7n. k.  K.K.n.k,  K0K_n0k. 

c(k,,k,)-  E  E  i(n,,n,)W„  1  3  1  Hi  1  4  1  V.  2  3  2  1 
1  2  n  -0  n  -0  121  !  * 

K2K4n2k2 
W'I 


(3-25) 


Now,  if  the  l^s  are  chosen  so  that 


<K1K3>N“N2 


<K2K4>N-N1 


<KlK4>N“<K2K3>N-° 


(3-26) 

(3-27) 

(3-28) 


equation  (3-23)  reduces  to 

-  N  -1  N  -1  n  k  n  k 

c(k  ,k  )-  Z  E  x(n  ,n  )W  1  !W  (3-29) 

1  2  n^O  n2-0  1  2  *1  J2 

which  is  the  basic  form  of  the  PFA.  A  listing  of  the 
program  implementing  this  algorithm  is  included  in  Appendix 
D.  The  unscrambling  constant  used  to  permute  the  sequence 
into  the  proper  order  is  determined  from 

UNSO<  E(N/ni)>.j  (3-30) 

where  n  are  the  factors  of  the  sequence  length  N. 

Now  that  each  of  the  algorithms  executed  have  been 
analyzed,  the  computer  architectures  on  which  they  were 
executed  will  be  explored. 
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IV.  Computer  Architectures 

Before  describing  the  architecture  of  each  computer, 
the  word  architecture  needs  to  be  defined.  A  common 
definition  of  computer  architecture  is  everything  an 
assembly  language  programmer  needs  to  know  to  write  a 
program  that  will  run.  Thus,  the  available  instructions  and 
the  number  of  registers  would  be  part  of  the  architecture, 
while  data  path  widths  and  cache  memory  would  not.  However, 
for  the  purposes  of  this  study,  this  definition  is  not 
sufficient.  This  study  defines  computer  architecture  as 
everything  a  FORTRAN  programmer  needs  to  know  to  write  an 
FFT  program  that  runs  as  fast  as  possible.  With  this 
definition,  items  such  as  cache  memory  and  even  the  compiler 
are  considered  part  of  the  computer  architecture.  This  view 
of  architecture  encompasses  the  entire  computer  system.  The 
main  features  of  each  computer  architecture  are  listed  in 
Table  4.1. 

Cray-1 . 

The  Cray-1  is  a  high  speed  scientific  computer.  It  has 
8  24-bit  address  registers  and  8  64-bit  scalar  registers,  as 
well  as  8  64-element  vector  registers  of  64  bits  each.  The 
CPU  also  contains  a  high  speed  buffer  of  64  operand 
registers  and  64  address  registers,  and  four  instruction 
buffers  each  containing  64  registers.  The  Cray-1  CPU  has 
separate  pipelined  functional  units  for  scalar  addition, 
floating  addition,  and  floating  multiplication,  each  of 
which  can  operate  in  parallel  with  the  others.  The  Cray-1 
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real  time  clock  used  for  algorithm  timing  has  a  resolution 
of  12.5  nanoseconds.  A  block  diagram  of  the  CPU  is  shown  in 
Figure  4.1  (Cray  Research,  2240004). 

The  FORTRAN  source  code  was  compiled  using  the  CFT  1.09 
compiler,  which  implements  the  FORTRAN  77  language.  The 
programs  were  executed  under  the  COS  1.09  operating  system. 
The  Cray-1  instruction  set  includes  both  vector  and  scalar 
operations.  Table  4.2  contains  the  Cray-1  instruction 
timings . 

CPC  Cyber  750 . 

The  CDC  Cyber  750  is  a  high  speed  scientific  computer. 
It  has  8  operand  registers,  8  address  registers,  and  8  index 
registers.  Each  operand  register  has  a  corresponding 
address  and  index  register.  Six  of  the  register  sets  read 
from  memory,  while  two  write  into  memory.  Placing  an 
address  into  an  address  register  causes  the  computer  to 
initiate  the  desired  fetch  or  store.  The  CPU  has  separate 
pipelined  functional  units  for  integer  addition,  floating 
addition,  floating  multiplication,  and  floating  division 
that  can  operate  in  parallel.  An  instruction  stack  of  12 
registers  provides  high  speed  buffer  storage.  According  to 
CDC,  the  Cyber  750  real  time  clock  has  a  resolution  of  10 
milliseconds.  However,  testing  done  by  Route  found  the 
clock  to  be  accurate  to  1  millisecond.  A  block  diagram  of 
the  system  is  in  Figure  4.2  (Control  Data  Corporation, 
1979). 

The  FORTRAN  source  code  was  compiled  using  the  FTN  4.8 
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compiler  with  option  2,  which  optimizes  the  execution  time. 
The  CDC  compiler  has  three  levels  of  optimization  to  improve 
the  execution  speed  of  the  algorithm  (Control  Data 
Corporation,  1982).  The  compiler  recognizes  and  replaces 
common  expressions,  removes  invariant  computations  from 
within  loops,  retains  frequently  referenced  variables  in 
registers,  and  assigns  frequently  referenced  variables  to 
registers  across  loops.  It  also  evaluates  constant 
expressions  at  compile  time,  and  simplifies  subscript 
calculations  by  using  additions  instead  of  multiplications 
where  possible.  The  compiler  may  reorder  some  of  the 
operations  to  minimize  the  idle  time  of  the  functional 
units.  The  programs  were  run  under  the  NOS/BE  operating 
system.  Table  4.3  lists  the  Cyber  750  instruction  timings. 
IBM  370/155. 

The  IBM  370/155  is  a  general  purpose  mainframe 
computer.  It  has  16  integer  and  address  registers  and  4 
floating  point  accummulators .  The  CPU  can  pre-fetch  up  to 
three  instructions,  and  the  fetch  and  execution  cycles  can 
be  overlapped.  High  speed  buffer  storage  of  8  kbytes 
decreases  the  average  data  transfer  time.  The  IBM  370/155 
real  time  clock  has  a  resolution  of  3.33  milliseconds. 
A  block  diagram  of  the  system  is  in  Figure  4.3  (IBM,  1972). 

The  source  code  was  compiled  using  the  FORTRAN  H  level 
21.8  compiler  with  option  2,  which  gives  the  most  optimized 
object  code.  The  IBM  compiler  has  three  levels  of 
optimization  to  improve  the  execution  speed  of  the 
algorithms  (IBM,  1974).  The  compiler  recognizes  and 
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(Control  Data  Corporation,  1979) 


Fig  4.3.  IBM  370/155  Architecture 


replaces  common  expressions,  removes  invariant  computations 
from  within  loops,  retains  frequently  referenced  variables 
in  registers,  and  assigns  frequently  referenced  variables  to 
registers  across  loops.  It  also  simplifies  subscript 
calculations  by  using  additions  instead  of  multiplications 
where  possible.  The  programs  were  run  under  the  OS/MVT 
version  21.8  F  operating  system.  Table  4.4  lists  the  IBM 
instruction  timings. 

DEC  VAX  11/780. 

The  DEC  VAX  11/780  is  a  32  bit  super  minicomputer.  The 
computer  was  designed  as  an  extension  and  improvement  over 
DEC's  PDP  11  series.  It  overcomes  the  main  drawback  of  the 
PDP  11  series,  namely,  the  program  cannot  be  longer  than  64 
kbytes.  In  fact,  the  name  VAX  comes  from  the  words  virtual 
address  extension.  The  VAX  11/780  is  near  the  upper  end  of 
the  performance  scale  of  the  VAX  series  of  computers. 

The  VAX  11/780  has  16  32  bit  registers  (Digital 
Equipment  Corporation,  1982).  However,  four  of  the 
registers  are  used  by  the  operating  system  as  a  program 
counter,  stack  pointer,  argument  pointer,  and  frame  pointer. 
The  CPU  contains  three  types  of  high  speed  buffer  storage: 
a  memory  cache,  an  address  translation  buffer,  and  an 
instruction  buffer.  The  memory  cache  contains  8  kbytes,  is 
direct  mapped,  and  is  designed  to  achieve  a  cache  hit  ratio 
of  95X.  When  a  cache  miss  occurs,  8  bytes  are  read  from 
memory  into  the  cache.  The  address  translation  buffer 
contains  128  of  the  most  frequently  used  physical  to  virtual 
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TABLE  4.4 


IBM  370/155  Instruction  Timings 


Instruction 

Execution  Time 
in  Microseconds 

Load  Register 

LA 

0.499 

LCER 

0.959 

LER 

0.844 

LR 

0.384 

Load  from  Memory 

L 

0.648 

LE 

0.993 

Store  to  Memory 

1  i 

ST 

0.604  (2 . 104) | 

STE 

1.074  (2.344)  | 

Add  Integer 

1 

1 

A 

0.993 

AR 

0.499 

Add  Floating  Point 

AE 

2.749 

AER 

2.255 

Multiply  Floating  Point 

ME 

7.387 

MER 

6.778 

Divide  Floating  Point 

DE 

8.986 

DER 

8.952 

Compare 

C 

0.763 

CR 

0.384 

Branch 

BC 

0.844  CO. 499), 

BCR 

0.729  (0.384), 

BXLE 

1.304  (0.959) 

*If  the  previous  instruction  was  a  store  instruction,  the 

execution  time  in  parenthesis 

is  used. 

2 

If  the  branch  conditions  fail. 

the  execution  time  in 

parenthesis  is  used. 

(IBM,  1972) 
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address  translations.  The  instruction  prefetch  buffer  can 
hold  8  bytes  and  is  always  kept  full  by  the  CPU  control 
logic.  The  VAX  11/780  used  in  this  study  contained  the 
optional  high  performance  floating  point  accelerator.  The 
floating  point  accelerator  is  used  on  all  floating  point 
operations  and  can  also  be  used  on  32  bit  integer  multiply 
instructions.  With  the  floating  point  accelerator,  a 
floating  add  register  to  register  can  take  as  little  as  800 
nsec,  while  a  floating  multiply  register  to  register  can 
take  as  little  as  1  usee.  The  CPU  is  microprogrammed, 
supports  32  interrupt  priority  levels,  and  has  both  a 
realtime  clock  and  a  time  of  year  clock.  The  memory  uses 
MOS  technology  and  is  expandable  in  256  kbyte  increments. 
An  LSI  11  microcomputer  is  used  as  a  console  subsystem  and 
for  initial  loading  of  the  operating  system.  The  VAX  11/780 
real  time  clock  has  a  resolution  of  16.7  milliseconds.  A 
detailed  block  diagram  of  the  CPU  is  shown  in  Figure  4.4. 
Figure  4.5  contains  the  overall  system  block  diagram. 

The  VAX  11/780  was  designed  for  a  multiprogramming 
environment  with  the  inclusion  of  features  such  as  rapid 
context  switching,  priority  dispatching,  virtual  addressing, 
and  memory  management  (Digital  Equipment  Corporation, 
1981).  The  VAX  11/780  has  a  virtual  address  space  of  4 
Gbytes.  However,  half  of  this  memory  space  is  reserved  for 
the  operating  system  and  overhead,  leaving  a  user  virtual 
address  space  of  2  Gbytes.  The  VAX  supports  five  different 
integer  data  types:  byte,  word,  longword,  quadword,  and 
octaword.  In  addition,  there  are  four  different  floating 


IV- 12 


Console 


Figure  4.4.  Expanded  Block  Diagram  of  DEC  VAX  11/780  CPU 


Figure  4.5.  DEC  VAX  11/780  System  Block  Diagram 


point  formats,  each  with  varying  degrees  of  precision.  The 
VAX  also  has  several  other  data  types,  such  as  packed 
decimal  and  character  string.  Like  the  PDP  11,  the  VAX  has 
seven  different  operand  addressing  modes.  The  similarity  of 
the  VAX  to  the  PDP  11  is  deliberate.  The  VAX  was  designed 
so  that  PDP  11  programs  could  be  executed  on  the  VAX  if 
desired. 

The  operating  system  under  which  the  FFT  programs  were 
run  is  the  Berkeley  UNIX  version  4.1  multi-user  system 
(Kernighan,  1978).  The  operating  system  was  written  in  the 
C  programming  language,  and  is  designed  to  work  best  with 
the  C  language,  but  can  be  used  with  FORTRAN.  The  FFT 
programs  were  compiled  using  the  FORTRAN  77  compiler  f 7 7  and 
the  optional  C  language  optimizer.  The  FORTRAN  source  code 
is  first  converted  to  C,  optimized,  and  then  compiled  into 
machine  code.  The  compiler  also  has  an  option  for 
generating  an  assembly  language  listing.  The  instruction 
timings  for  the  VAX  are  considered  company  confidential 
information  and  would  not  be  released  by  DEC  (Dixon,  1983). 
An  attempt  was  made  to  determine  the  instruction  timings 
experimentally  by  solving  the  system  of  equations  determined 
from  the  results  given  in  Chapter  5.  However,  since  the  VAX 
uses  instruction  overlap  and  a  cache  memory,  the  system  is 
not  linear,  and  no  valid  results  could  be  obtained. 

DEC  PDP  11/60. 


The  DEC  PDP  11/60  is  a  general  purpose  minicomputer. 
It  has  8  integer  registers,  but  only  six  are  useable  by  the 


program  for  operand  storage,  with  the  other  two  used  as  a 
program  counter  and  stack  pointer.  The  computer  has  the 
optional  floating  point  processor  for  performing  floating 
operat*  ons  which  can  operate  in  parallel  with  the  main  CPU. 
The  floating  point  processor  contains  6  floating  point 
accummulators ,  of  which  only  4  are  memory  accessable.  The 
CPU  contains  2  kbytes  of  high  speed  cache  memory,  and  uses 
the  DEC  Unibus  to  transfer  data  in  and  out  of  memory.  The 
PDP  11/60  real  time  clock  has  a  resolution  of  16.7 
milliseconds.  A  block  diagram  is  in  Figure  4.6  (Digital 
Equipment  Corporation,  1979). 

The  source  code  was  compiled  using  the  F4P  version  3.0 
compiler.  The  DEC  compiler  has  a  fixed  level  of 
optimization  (Digital  Equipment  Corporation,  1981).  The 
compiler  recognizes  and  replaces  common  expressions,  removes 
invariant  computations  from  within  loops,  retains  frequently 
referenced  variables  in  registers,  and  assigns  frequently 
referenced  variables  to  registers  across  loops.  It  also 
evaluates  constant  expressions  at  compilation  time.  The 
programs  were  run  under  the  DEC  RSX-11M  version  3.2 
multiuser  operating  system.  Table  4.5  lists  the  instruction 
timings . 

DEC  PDP  11/50. 

The  DEC  PDP  11/50  is  a  general  purpose  minicomputer. 
For  the  purposes  of  this  study,  the  PDP  11/50  is  identical 
to  the  PDP  11/60  except  that  the  PDP  11/50  has  no  cache 
memory  and  it  has  a  slower  version  of  the  optional  floating 
point  processor.  The  PDP  11/50  real  time  clock  has  a 
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PDP  11/60  Architecture 


resolution  of  16.7  milliseconds.  A  block  diagram  of  the 
architecture  is  shown  in  Figure  4.7  (Digital  Equipment 
Corporation,  1976). 

The  source  code  was  compiled  using  the  F4P  version  2.4 
compiler.  The  compiler  has  a  fixed  level  of  optimization 
(Digital  Equipment  Corporation,  1981).  It  recognizes  and 
replaces  common  expressions,  removes  invariant  computations 
from  within  loops,  retains  frequently  referenced  variables 
in  registers,  and  assigns  frequently  referenced  variables  to 
registers  across  loops.  It  also  evaluates  constant 
expressions  at  compile  time.  The  programs  were  run  under 
the  RSX-11M  version  3.1  multiuser  operating  system.  Table 
4.6  lists  the  instruction  timings. 

Cromemco  Z-2D. 

The  Cromemco  Z-2D  is  a  general  purpose  microcomputer. 
It  uses  a  Z-80  microprocessor  with  a  4  Megahertz  clock  as 
its  CPU.  The  Z-80  contains  12  8-bit  registers  which  can  be 
combined  into  6  16-bit  registers.  The  CPU  cannot  perform 
multiplication  or  division  and  must  rely  on  software 
routines  for  these  functions.  Memory  is  accessed  through  an 
S-100  bus.  The  Cromemco  Z-2D  has  no  real  time  clock.  A 
block  diagram  of  the  architecture  is  shown  in  Figure  4.8 
(Zilog,  1977). 

The  source  code  was  compiled  using  the  version  3.21 
compiler  supplied  by  Cromemco.  The  programs  were  run  under 
the  CDOS  version  2.36  operating  system,  copyright  1980  by 
Cromemco,  Inc.  Table  4.7  lists  the  CPU  instruction  timings. 
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Figure  4.7.  DEC  PDF  11/50  Architecture 
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Figure  4.8.  Cromemco  Z-2D  Architecture 


The  instruction  timings  for  the  floating  point  (real) 
operations  were  determined  by  executing  a  loop  containing 


each  instruction.  Thus,  the  values  for  the  floating  point 
instructions  were  derived  experimentally,  while  the  others 
were  provided  by  the  manufacturer. 

Now  that  an  analysis  of  the  algorithms  has  been 
presented,  and  each  of  the  computer  architectures  has  been 
described,  the  next  chapter  will  present  the  results  of 
executing  the  algorithms  on  the  computer  architectures. 
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V.  Results 


Each  of  the  algorithms  was  run  for  different  sequence 
lengths  on  seven  computers:  a  Cray-1,  a  CDC  Cyber  750,  an 
IBM  370/155,  a  DEC  VAX  11/780,  a  DEC  PDP  11/60,  a  DEC  PDP 
11/50,  and  a  Cromemco  Z-2D.  The  radix-2  algorithm  was 
executed  for  sequence  lengths  of  512,  1024,  and  2048,  while 
the  other  three  were  executed  for  sequence  lengths  of  504, 
630,  1008,  1260,  and  2520. 

Methodology 

The  time  required  to  perform  a  transform  using  each 
algorithm  was  determined  for  each  sequence  length.  The 
times  are  average  values  obtained  by  repeated  execution  of 
the  algorithm.  The  execution  time  measured  was  for  the 
transform  algorithm  only  and  does  not  include  the  time  to 
digitize  the  input  data  or  to  print  results.  The  WFTA1 
values  include  the  time  required  for  initialization,  while 
the  WFTA2  values  do  not.  For  all  computers  except  the 
Cromemco  Z-2D,  the  time  was  measured  using  the  system 
realtime  clock.  The  clock  was  initialized  to  zero  at  the 
beginning  of  the  algorithm,  and  read  at  the  end  of  the 
algorithm.  The  Cromemco  Z-2D  does  not  have  a  realtime 
clock,  so  the  execution  speed  was  measured  with  a  stopwatch. 
Messages  were  printed  on  the  screen  at  the  beginning  and  end 
of  the  algorithms,  and  the  time  between  these  messages  was 
measured.  Repeated  trials  show  this  method  to  be  accurate 
to  within  ±1  second,  which  is  sufficiently  accurate 
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considering  the  execution  times  of  the  algorithms  on  the 
Cromeroco . 

The  memory  requirements  and  data  array  sizes  have  been 
determined  in  other  studies  and  the  results  are  available  in 
the  literature  (Burrus  and  Eschenbacher ,  1981;  Silverman, 
1977;  Kolba  and  Parks,  1977;  Morris,  1978,  Blanken  and 
Rustan,  1982). 

For  all  computers,  an  assembly  language  listing  of  the 
FORTRAN  source  code  was  obtained  and  analyzed.  The  number 
of  instructions  executed  was  determined  for  each  of  the 
different  instruction  types.  These  results  were  obtained  by 
dividing  the  assembly  language  into  the  sections  listed  in 
Tables  3.1  through  3.4.  Then  the  number  of  each  type  of 
instruction  in  that  section  was  counted.  The  total  number 
of  that  particualr  instruction  which  was  executed  was 
obtained  by  multiplying  the  number  of  instructions  for  each 
section  by  the  number  of  times  each  section  was  executed  and 
then  adding  the  results  of  each  section.  Typically,  over 
99.52  of  the  floating  raultiply/divide  instructions  were 
multiplies,  while  over  90X  of  the  integer  operations  were 
additions  or  subtractions.  The  number  of  floating 
operations  are  dependent  on  the  FFT  algorithm,  and  are 
approximately  equal  to  the  number  predicted  by  the  equations 
given  in  the  earlier  section  of  this  study,  while  the  number 
of  integer  operations  and  data  transfers  are  dependent  on 
the  compiler  and  the  computer  architecture,  and  are 
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different  for  each  computer.  However,  all  are  dependent  on 
the  sequence  length. 

The  percentage  of  time  taken  by  each  instruction 
category  was  determined  by  multiplying  the  number  of  each 
type  of  instruction  by  the  effective  instruction  time. 
These  individual  results  were  then  summed  to  find  the  total 
time.  This  time  is  not  representative  of  the  actual 
measured  time  because  instruction  overlap  was  assumed  to  be 
zero  and  some  types  of  instructions  (i.e.,  increments  and 
decrements)  were  neglected.  However,  these  values  can  be 
used  for  rough  comparisons  between  algorithms  and 
architectures.  The  relationship  between  the  instruction 
counts,  the  execution  speeds,  and  the  computer  architecture 
will  be  analyzed  in  the  next  chapter. 

For  each  of  the  architectures,  the  correlation 
coefficients  between  the  execution  times  and  the  instruction 
counts  are  determined.  The  correlation  coefficients  are 


calculated  from  the  equation 

,  £  •*  Xt  -j)  _ 

P  '  ====r^r-^:-_-5-  ’  ( 5-1 ) 

-  i) 

where  x^  is  the  execution  speed  for  a  particular  sequence 
length,  y^  is  the  number  of  instructions  for  that 
instruction  category  and  sequence  length,  ~x  is  the  average 
execution  time,  y  is  the  average  instruction  count,  and  N  is 
the  number  of  samples.  The  correlation  coefficient  is  used 
to  illustrate  the  relationship  between  two  variables.  It 
provides  a  method  for  evaluating  the  extent  to  which 
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variations  in  one  variable  are  associated  with  variations  in 
another  variable  and  permits  an  appraisal  of  the  closeness 
of  the  cause  and  effect  relationship.  A  correlation 
coefficient  can  vary  from  -1  to  +1.  The  closer  the 
correlation  coefficient  is  to  +1,  the  greater  the  cause  and 
effect  relationship  is  between  the  variables.  But  in  order 
for  a  correlation  coefficient  to  be  meaningful,  the  sample 
size  must  be  large  enough  to  eliminate  chance  effects.  For 
most  of  the  correlation  coefficients  calculated  for  this 
study,  the  sample  size  is  twenty-three,  which  consists  of 


three  from 

the 

three  sequence 

lengths 

of 

the  radix-2 

algorithm , 

five 

from  the  MFFT, 

five  from 

the 

WFTA1 ,  five 

from  the  WFTA2, 

and  five  from 

the  PFA. 

A 

correlation 

coefficient 

was 

calculated  between  the  execution  time  and 

each  of  the  following  instruction  categories:  data 
transfers,  floating  additions  and  subtractions,  floating 
multiplications  and  divisions,  and  integer  operations.  For 
example,  when  calculating  the  correlation  coefficient 
between  the  execution  times  and  the  data  transfers,  the 
execution  speeds  are  the  x  terms,  while  the  number  of  data 
transfers  are  the  y  terras.  The  correlation  coefficients 
could  indicate  which  instruction  category  is  most  closely 
related  to  the  execution  time. 
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Table  5.1  lists  the  execution  speeds  of  each  of  the 
algorithms  and  sequence  lengths  on  the  Cray-1.  Using  a 
clock  resolution  of  12.5  nanoseconds  and  the  minimum 
execution  time  of  2.73  milliseconds,  the  maximum  percentage 
error  is  0.5%.  The  correlation  coefficients  between  the 
execution  speeds  and  the  four  major  instruction  categories 
are: 


floating  multiply/divide 

0.8848 

floating  add/subtract 

0.9336 

integer  operations 

0.9700 

data  transfers 

0.9480. 

Tables  5.2  through  5.6  list  the  number  of  instructions  in 
each  category  for  each  algorithm  and  sequence  length.  The 
correlation  coefficients  range  from  0.8848  for  the  floating 
multiplications  and  divisions  to  0.9700  for  the  integer 
operations,  with  the  correlation  coefficient  for  the  data 
transfers  being  0.9480.  The  high  correlation  between  the 
execution  speed  and  number  of  data  transfers  is  related  to 
the  instruction  timings.  An  operand  load  requires  137.5 
nsec,  a  store  or  register  transfer  requires  12.5  nsec,  a 
floating  multiply  requires  87.5  nsec,  and  a  floating  add 
requires  75.0  nsec.  The  ratio  of  floating  multiply  speed  to 
data  transfer  speed  is  2.2.  Figure  5.1  shows  the  percentage 
of  execution  time  taken  by  each  of  the  instruction  types. 
The  value  used  for  the  typical  data  transfer  time  was  106.25 
nsec,  which  considers  75X  of  the  data  transfers  to  be  loads, 
with  the  other  25Z  being  stores  and  register  transfers. 


V-5 


w 

■u 

»H 

3 

0} 

a> 

• 

m 

P< 

*H 

w 

< 

►J 

£ 

3 

H 

3 

>0 

00 

CN 

00 

CN 

LA 

vO 

iH 

CN 

sO 

CN 

O 

CO 

vO 

CN 

vO 

Os 

CN 

o\ 

Ov 

O 

fs. 

LA 

CO 

sO 

«H 

i— l 

vO 

CO 

O 

CO 

LA 

00 

LA 

O 

ON 

00 

CO 

o 

O' 

00 

sO 

CO 

r* 

sO 

CO 

-cr 

LA 

ON 

CO 

LA 

00 

<r 

o 

\D 

r%. 

O' 

O 

*H 

CN 

CO 

CO 

o 

<r 

O 

O' 

CO 

CO 

fH 

LA 

CN 

<r 

sO 

nt 

CN 

co 

CO 

oo 

CO 

LA 

o 

»H 

on 

CN 

00 

fH 

o 

iH 

CN 

VO 

rH 

CN 

CN 

00 

o  ; 

CO 

CO  1 

CN 

00 

vO 

vO 

CN 

CN 

<■ 

00 

CO 

O 

<r 

CO 

Os 

00 

LA 

<r 

CN 

Os 

*H 

•H 

fH 

lA 

o 

CN 

CN 

0) 

•o 

•H 

>  4J 

*H  O 

Q  CO 

>>  *J 

*H  JD 

a  s 

•H  C/5 


Figure  5.1.  Percentage  of  Time  per  Instruction  Category  -  Cray-1 


Since  over  99.5%  of  the  integer  operations  were  additions, 
the  value  of  25  nsec  was  used  for  the  integer  operations. 
Instructions  other  than  those  in  the  four  major  categories 
are  neglected.  The  WFTA1  has  the  greatest  percentage  of 
time  taken  by  data  transfers  with  86%,  while  the  MFFT  has 
the  smallest  with  76%.  The  WFTA2  has  the  fewest  floating 
multiplies  and  is  the  fastest  algorithm,  while  the  WFTAl  has 
the  most  floating  multiplies  and  is  the  slowest  algorithm  on 
the  Cray-1.  An  examination  of  the  assembly  language  shows 
that  only  the  WFTA  compiled  using  vector  operations  (Route, 
1981),  thus  the  UFTA2  had  the  shortest  execution  time. 
However,  the  compiler  placed  more  floating  multiplies  and 
data  transfers  in  the  WFTA  initialization  routine  than  any 
other  compiler,  causing  the  WFTAl  to  have  the  longest 
execution  time.  Thus,  the  availability  of  vector  operations 
on  the  Cray-1  benefited  the  WFTA2. 
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CDC  Cyber  750 


Table  5.7  lists  the  execution  speeds  of  each  of  the 
algorithms  and  sequence  lengths  on  the  CDC  Cyber  750.  Using 
a  clock  resolution  of  1  millisecond  and  a  minimum  execution 
time  of  10  milliseconds,  the  maximum  percentage  error  is 
10%.  The  object  code  produced  by  the  optimizing  compiler 
(option  2)  on  the  Cyber  750  is  three  times  faster  than  the 
code  produced  by  the  non-optimizing  compiler  (option  0). 
Throughout  this  study  the  optimized  object  code,  when 
available,  was  compared  for  each  of  the  computers.  The 
correlation  coefficients  between  the  execution  speeds  and 
four  major  instruction  categories  are: 


floating  multiply/divide 

0.7251 

floating  add/subtract 

0.9187 

integer  operations 

0.9569 

data  transfers 

0.9988. 

Tables  5.8  through  5.25  list  the  number  of  instructions  in 
each  category  for  each  algorithm  and  sequence  length.  The 
values  for  the  CDC  Cyber  750  range  from  0.7251  for  the 
floating  point  multiplications  and  divisions  to  0.9988  for 
the  data  transfers.  The  reason  for  the  smaller  correlation 
coefficient  between  the  execution  speed  and  the  floating 
operations  is  the  separate  pipelined  functional  units  of  the 
Cyber  750,  which  increases  the  speed  of  the  floating  point 
operations  but  do  not  affect  the  data  transfer  rates.  Thus 
the  dependency  of  the  execution  speed  on  the  floating 
operations  is  decreased.  The  radix-2  has  the  fewest  data 
transfers  and  is  the  fastest  algorithm  on  the  Cyber  750, 
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even  though  it  does  not  have  the  fewest  floating  operations. 
For  a  sequence  length  of  1008,  the  WFTA1  has  the  most  data 


transfers 

and 

is 

the 

slowest 

algorithm. 

For 

all 

other 

sequence 

lengths 

tested,  the 

MFFT  has 

the 

most 

data 

transfers 

and 

is 

the 

slowest 

algorithm. 

Thus 

the 

data 

transfer  instructions  are  clearly  more  dominant  than  any  of 
the  floating  operations.  The  correlation  coefficient  for 
the  integer  operations  is  large  because  integer  operations 
are  used  to  perform  the  address  calculations  for  data 
transfers,  and  thus  are  closely  related  to  the  number  of 
data  transfers.  The  reason  the  data  transfers  are  so 
important  is  clear  from  the  instruction  timings.  Fetching 
an  operand  from  memory  requires  475  nsec  while  multiplying 
two  floating  point  numbers  requires  only  125  nsec  (Control 
Data  Corporation,  1979).  Figure  5.2  shows  the  percentage 
of  execution  time  taken  by  each  of  the  instruction  types. 
The  radix-2  has  the  smallest  percentage  of  time  taken  by 
data  transfers,  which  is  consistent  with  the  fact  that  it  is 
the  fastest,  while  the  MFFT  has  the  largest  percentage  of 
time  taken  by  data  transfers.  Thus,  the  speeds  of  the 
algorithms  on  the  Cyber  750  are  limited  mostly  by  the  data 
transfer  rate. 
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IBM  370/155 

Table  5.26  lists  the  execution  speeds  of  each  of  the 
algorithms  and  sequence  lengths  on  the  IBM  370/155.  Using  a 
clock  resolution  of  3.33  milliseconds  and  a  minimum 
execution  time  of  103  milliseconds,  the  maximum  percentage 
error  is  3.2%.  The  correlation  coefficients  between  the 
execution  speeds  and  the  four  major  instruction  categories 
are: 


floating  multiply/divide 

0.8659 

floating  add/subtract 

0.9522 

integer  operations 

0.7760 

data  transfers 

0.9507. 

Tables  5.27  through  5.44  list  the  number  of  instructions  in 
each  category  for  each  algorithm  and  sequence  length.  The 
correlation  coefficients  vary  from  0.7760  for  the  integer 
operations  to  0.9522  for  the  floating  add/subtract.  The 
execution  speed  is  most  closely  related  to  the  number  of 
floating  point  additions  and  subtractions,  which  is 
exemplified  by  the  fact  that  the  PFA  is  the  fastest 
algorithm  on  the  IBM  and  has  the  fewest  floating  point 
additions  and  subtractions,  while  the  MFFT  is  the  slowest 
and  has  the  most  floating  point  additions  and  subtractions. 
Since  the  architecture  has  a  multipurpose  functional  unit, 
the  floating  operations  have  a  greater  effect  on  the 
execution  speed,  whereas  the  high  speed  buffer  memory 
decreases  the  dependence  of  the  execution  speed  on  the 
number  of  data  transfers.  Figure  5.3  shows  that  the  WFTA1 
has  the  largest  percentage  of  execution  time  taken  by  data 
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Figure  5.3.  Percentage  of  Time  per  Instruction  Category  -  IBM  370/155 


DEC  VAX  11/780 

Table  5.45  lists  the  execution  speeds  of  each  of  the 
algorithms  and  sequence  lengths  on  the  VAX  11/780.  Using  a 
clock  resolution  of  16.7  milliseconds  and  a  minimum 
execution  time  of  85  miliseconds,  the  maximum  percentage 
error  is  19.6%.  The  correlation  coefficients  between  the 
execution  speeds  and  four  major  instruction  categories  are: 


floating  multiply/divide 

0.6634 

floating  add/subtract 

0.9495 

integer  operations 

0.9244 

data  transfers 

0.9724. 

Tables  5.46  through  5.63  list  the  number  of  instructions  in 
each  category  for  each  algorithm  and  sequence  length.  The 
correlation  coefficients  range  from  0.6634  for  the  floating 
multiplications  and  divisions  to  0.9724  for  the  data 
transfers.  Thus  the  floating  point  accelerator  helped  reduce 
the  dependence  of  the  execution  speed  on  the  floating  point 
operations.  However,  even  though  the  VAX  has  an  8  kbyte 
cache  memory,  the  highest  correlation  occurred  between  the 
data  transfers  and  the  execution  speed.  Since  the 
instruction  times  are  not  available  for  the  VAX,  the 
percentage  of  time  spent  on  each  instruction  category  could 
not  be  determined.  Attempts  were  made  to  determine  the 
instruction  timings  using  the  instruction  counts  and  total 
execution  time  to  form  a  system  of  linear  equations.  But 
solving  this  system  of  equations  resulted  in  some  negative 
instruction  times.  In  order  to  determine  an  approximate 
value  for  the  instruction  times,  subsets  of  the  system  of 
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equations  were  set  up  and  solved.  Again  some  of  the  results 
were  negative  and  the  reasonable  results  were  not 
sufficiently  repeatable  between  different  subsets.  The 
inability  to  solve  the  system  of  equations  is  due  to  the 
fact  that  the  actual  system  is  not  linear.  The  use  of 
instruction  overlap,  a  separate  floating  point  accelerator, 
and  a  cache  memory  introduce  nonlinearities  into  the  system. 
However,  the  data  transfers  would  be  expected  to  take  the 
greatest  percentage  of  time,  given  the  correlation 
coefficients  and  the  results  of  the  other  architectures. 
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DEC  PDP  11/60 


Table  5.64  lists  the  execution  speeds  of  each  of  the 
algorithms  and  sequence  lengths  on  the  PDP  11/60.  Using  a 
clock  resolution  of  16.7  milliseconds  and  a  minimum 
execution  time  of  183  milliseconds,  the  maximum  percentage 
error  is  9.1%.  The  correlation  coefficients  between  the 
execution  speeds  and  four  major  instruction  categories  are: 


floating  multiply/divide 

0.9760 

floating  add/subtract 

0.9846 

integer  operations 

0.9831 

data  transfers 

0.9962. 

Tables  5.65  through  5.82  list  the  instruction  counts  for 
each  category  and  sequence  length.  The  correlation 
coefficients  for  the  PDP  11/60  range  from  0.9670  for  the 
floating  multiplications  and  divisions  to  0.9962  for  the 
data  transfers.  The  execution  speed  increase  expected  from 
the  addition  of  the  floating  point  processor  is  limited  by 
the  fact  that  two  memory  cycles  are  required  to  transfer  one 
operand  (Digital  Equipment  Corporation,  1979).  In 
addition,  computational  overhead  is  required  to  set  up  the 
addressing  for  the  floating  point  operand.  Figure  5.4 
shows  that  the  WFTA2  has  the  largest  percentage  of  execution 
time  taken  by  data  transfers  with  57%,  while  the  MFFT  has 
the  smallest  percentage  with  49%. 
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Unable  to  execute  due  to  Insufficient  memory 
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Total  107976  89208  105120  87885  105560  59424  555173 


Percentage  of  Time  per  Instruction  Category  -  DEC  PDP  11/60 


DEC  PDP  11/50 

Table  5.83  lists  the  execution  speeds  of  each  of  the 
algorithms  and  sequence  lengths  on  the  PDP  11/50.  Using  a 
clock  resolution  of  16.7  milliseconds  and  a  minimum 
execution  time  of  411  milliseconds,  the  maximum  percentage 
error  is  4.1%.  The  correlation  coefficients  between  the 
execution  speeds  and  four  major  instruction  categories  are: 


floating  multiply/divide 

0.9186 

floating  add/subtract 

0.9575 

integer  operations 

0.9750 

data  transfers 

0.9966. 

Tables  5.65  through  5.82,  which  are  the  same  tables  as  those 
for  the  PDP  11/60,  list  the  instruction  counts  for  each 
category  and  sequence  length.  The  values  of  the  correlation 
coefficients  for  the  PDP  11/50  range  from  0.9186  for  the 
floating  multiplications  and  divisions  to  0.9966  for  the 
data  transfers.  Thus  the  execution  speed  is  most  closely 
related  to  the  number  of  data  transfers.  The  reason  for  the 
importance  of  the  data  transfers  can  be  seen  in  the 
instruction  timings.  A  floating  addition  requires  5.66 
microseconds,  and  a  floating  multiply  requires  7.61 
microseconds,  while  a  floating  load  requires  4.20 
microseconds  (Digital  Equipment  Corporation,  1976).  Thus, 
if  every  floating  operation  requires  two  loads,  then  the 
time  taken  by  the  floating  loads  is  more  than  the  time  taken 
by  the  floating  operations.  As  shown  in  Figure  5.5,  the 
radix-2  has  the  greatest  percentage  of  execution  time  taken 


v-no 


Cromemco  Z-2D 

Table  5.84  lists  the  execution  speeds  of  each  of  the 
algorithms  and  sequence  lengths  on  the  Cromemco  Z-2D.  Using 
a  timing  accuracy  of  1  second  and  a  minimum  execution  time 
of  8  seconds,  the  maximum  percentage  error  is  12.5%.  The 
correlation  coefficients  between  the  execution  speeds  and 
four  major  instruction  categories  are: 

floating  multiply/divide  0.9957 

floating  add/subtract  0.9668 

integer  operations  0.9954 

data  transfers  0.9919. 

Tables  5.85  through  5.102  list  the  instruction  counts  for 
each  of  the  algorithms  and  sequence  lengths.  The 

correlation  coefficients  for  the  Cromemco  Z-2D  range  from 
0.9668  for  the  floating  additions  and  subtractions  to  0.9957 
for  the  floating  multiplications  and  divisions.  Thus,  the 
execution  speed  is  most  closely  related  to  the  number  of 
floating  point  multiplications  and  divisions.  This  is 
because  the  floating  multiplications  and  divisions  must  be 
done  with  software  routines,  which  are  slower  than  floating 
point  hardware.  Thus,  as  expected,  the  PFA  is  the  fastest 
since  it  has  the  fewest  floating  point  multiplications, 

while  the  MFFT  is  the  slowest  since  it  has  the  most  floating 

point  multiplications.  Figure  5.6  shows  the  percentage  of 

time  taken  by  each  of  the  instruction  categories.  The 
radix-2  has  the  greatest  percentage  of  time  taken  by 
floating  multiplications  with  63%,  while  the  WFTA2  has  the 
smallest  with  20%.  The  WFTA  could  not  be  executed  due  to 
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Unable  to  execute  due  to  Insufficient  memory. 
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Others  9011  92253  101264 


TABLE  5.87 
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Figure  5.6.  Percentage  of  Time  per  Instruction  Category  -  Cromemco  Z-2D 


insufficient  memory,  but  based  solely  on  the  number  of 
floating  multiplications,  the  time  for  UFTA1  is  expected  to 
be  between  the  PFA  and  radix-2,  while  the  time  for  WFTA2  is 
expected  to  be  less  than  the  time  for  the  PFA.  Since  other 
microprocessors  have  the  same  basic  architecture  as  the 
Cromemco  Z-2D,  the  WFTA2  is  expected  to  be  the  fastest 
algorithm  and  MFFT  is  expected  to  be  the  slowest  algorithm 
on  any  microprocessor.  However,  the  memory  required  for  the 
WFTA  is  at  least  twice  as  large  as  the  PFA.  Thus,  if  the 
microprocessor  is  memory  limited  such  that  WFTA  would  not 
fit,  the  PFA  would  be  the  fastest  algorithm. 


VI.  Comparison  of  Computer  Architectures  and  Algorithms 

As  expected,  the  Cray-1  is  the  fastest  computer,  while 
the  Cromemco  Z-2D  is  the  slowest.  However,  no  one  algorithm 
was  always  the  fastest  or  slowest.  Comparing  the  sequence 
lengths  of  512,  1024,  and  2048  with  504,  1008,  and  2520, 
respectively,  an  ordering  of  the  algorithms  based  on 
execution  speed  is  different  for  each  computer.  The  fastest 
algorithms  on  the  different  computers  were:  UFTA2  on  the 

Cray-1,  radix-2  on  the  Cyber  750,  and  PFA  on  the  DEC  VAX 
11/780,  IBM  370/155,  DEC  PDP  11/60,  DEC  PDP  11/50,  and 
Cromemco  Z-2D.  The  slowest  algorithms  were:  WFTA1  on  the 

Cray-1  and  DEC  VAX  11/780;  MFFT  on  the  IBM  370/155  and 
Cromemco  Z-2D;  and,  depending  on  the  sequence  length,  either 
WFTA1  or  MFFT  on  the  PDP  11/60  and  PDP  11/50.  Executing  the 
algorithms  on  a  faster  computer  did  not  increase  the 
execution  speeds  of  all  the  algorithms  equally.  For 
example,  the  radix-2  algorithm  ran  an  average  2.55  times 
faster  on  the  Cray-1  than  on  the  Cyber  750,  while  the  WFTA2 
ran  an  average  of  5.04  times  faster.  Table  6.1  lists  the 
speed  increases  of  one  computer  over  another.  Figure  6.1 
shows  a  plot  of  the  execution  speed  versus  data  transfers 
for  the  architectures  studied.  Table  6.2  lists  the 
equations  of  the  lines  plotted  in  Figure  6.1.  The  reasons 
for  the  unequal  increases  in  performance  are  related  to  the 
computer  architecture  and  will  be  explored. 

Comparing  the  Cyber  750  and  the  Cray-1,  the  WFTA1  had 
the  greatest  increase  in  execution  speed.  WFTA1  was  an 


VI-1 


TABLE  6.2 


Equations  of  Lines  of  Best  Fit 

Cray-1: 

T  =  0.0335  D  +  2.8409 

CDC  Cyber  750: 

T  =  0.3736  D  -  0.5944 

IBM  370/155: 

T  =  2.6763  D  +  41.2501 

DEC  VAX  11/780: 

T  =  5.1865  D  +  38.7573 

DEC  PDP  11/60: 

T  =  3.6834  D  +  31.4450 

DEC  PDP  11/50: 

T  *  9.8966  D  -  2.0489 

Cromemco  Z-2D: 

T  =  97.3118  D  -  4576.0512 

average  of  5.0  times  faster  on  the  Cray-1,  while  the  radix-2 
was  only  2.6  times  faster.  The  availability  of  vector 
operations  benefited  WFTAl,  with  its  matrix  structure,  more 
than  the  others.  Improvements  in  the  Cyber  750  architecture 
could  be  made  by  decreasing  the  data  transfer  time  through 
the  use  of  high  speed  buffer  storage.  Optimization  of 
algorithms  for  the  Cyber  750  could  be  accomplished  by 
minimizing  the  number  of  data  transfers  from  memory. 

In  comparing  the  Cyber  750  and  IBM  370/155,  the  speed 
increases  of  the  Cyber  over  the  IBM  are  approximately  equal 
except  for  the  radix-2  algorithm,  which  benefited  twice  as 
much  as  the  others.  Since  the  Cyber  data  transfer  time  is 
slower  than  its  floating  operation  time,  whereas  the  IBM 
data  transfer  time  is  faster  than  its  floating  operation 
time,  algorithms  with  fewer  data  transfers  see  more  of  an 
improvement  when  executed  on  the  Cyber.  Thus,  any 
improvement  of  the  IBM  370/155  architecture  for  FFT 
execution  should  be  directed  towards  increasing  the 
throughput  of  floating  operations.  FFT  algorithms  can  be 
optimized  by  decreasing  the  number  of  floating  operations, 
even  at  the  expense  of  increased  data  transfers. 

In  comparing  the  DEC  VAX  11/780  to  the  IBM  370/155,  the 
WFTA  was  fastest  on  the  IBM,  while  the  other  three  were 
faster  on  the  VAX.  The  WFTAl  was  an  average  of  1.1  times 
faster  on  the  IBM,  while  the  WFTA2  times  were  within  51  of 
each  other  for  the  two  computers.  The  PFA  is  1.08  times 
faster,  MFFT  is  1.26  times  faster,  and  radix-2  is  1.12  times 


faster  on  the  VAX  than  the  IBM.  The  instruction  timings  are 
not  available  for  the  VAX.  However,  the  VAX  has  the 
greatest  improvement  over  the  IBM  on  the  MFFT.  Since  the 
MFFT  has  a  greater  number  of  floating  multiplies  than  the 
other  algorithms,  these  results  could  be  explained  by  the 
VAX  having  a  smaller  floating  multiply  to  data  transfer 
ratio  than  the  IBM,  which  has  a  ratio  of  7.4.  Thus  the  VAX 
performs  better  than  the  IBM  on  algorithms  with  more 
floating  point  operations.  However,  performance  is  about 
equal  on  algorithms  with  a  large  number  of  data  transfers. 
This  is  due  to  the  similar  architectures  of  the  computers. 
Both  have  cache  memories  which  improve  the  performance  of 
algorithms  with  many  data  transfers. 

The  speed  increases  of  the  IBM  370/155  over  the  PDP 
11/60  are  the  greatest  in  the  WFTA  and  PFA.  The  WFTA2  is 
1.8  times  faster  and  the  PFA  is  1.7  times  faster,  while  the 
MFFT  is  only  1.3  times  faster.  The  ratio  of  the  floating 
multiply  to  data  transfer  time  is  about  7.4  on  the  IBM 
370/155,  while  on  the  PDP  11/60  it  is  only  2.0.  Thus  the 
IBM  370/155  performs  better  than  the  PDP  11/60  on  algorithms 
which  have  fewer  floating  operations  but  more  data 
transfers.  No  obvious  areas  of  improvement  can  be  found  for 
the  PDP  11/60  architecture.  Likewise,  optimization  of  FFT 
algorithms  for  the  PDP  11/60  is  not  as  simple  as  the  other 
architectures.  Data  transfers  can  be  traded  for  floating 
multiplies  if  eliminating  on  multiply  adds  less  than  two 
data  transfers,  since  the  speed  ratio  between  these  two 
instructions  is  approximately  2.  The  same  reasoning  can  be 
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applied  to  trade-offs  between  data  transfers/floating  adds 
and  floating  adds/floating  multiplies. 

In  comparing  the  PDP  11/60  and  POP  11/50,  WFTA  executes 
2.2  times  faster,  while  MFFT  is  2.7  times  faster.  The  ratio 
of  floating  multiply  speed  to  data  transfer  speed  is  2.7  for 
the  PDP  11/50.  Since  the  speed  increases  are  relatively 
close,  no  relationship  between  the  speed  increases  and  the 
differences  in  the  computer  architectures  can  be  made.  The 
trade-offs  discussed  under  the  PDP  11/60  apply  here  with 
appropriate  modifications  to  the  instruction  speed  ratios. 

The  radix-2  algorithm  shows  the  greatest  increase  in 
execution  speed  when  run  on  the  PDP  11/50  over  the  Cromemco 
Z-2D.  It  increased  28.37  times,  while  the  PFA  increased 
18.43  times.  The  floating  multiply  to  data  transfer  speed 
ratio  is  308.  Thus  a  greater  improvement  occurs  in  the 
algorithms  with  more  floating  operations.  Since  the 
correlation  coefficients  between  the  execution  speed  and  the 
floating  operations  is  the  greatest,  the  Cromemco  Z-2D 
architecture  could  best  be  improved  with  the  addition  of 
floating  multiplication  and  addition  functional  units. 
Optimization  of  the  algorithm  could  be  accomplished  by 
eliminating  a  floating  multiplication  as  long  as  less  than 
308  data  transfers  are  added. 
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VII.  Minimum  Computer  Architecture  for  Efficient 

Performance 

This  section  will  present  the  minimum  computer 
architecture  needed  to  efficiently  execute  FFT  algorithms. 
The  proposed  architecture  is  by  no  means  optimum,  because 
more  features  could  be  added  to  improve  the  algorithm 
performance  if  desired.  The  decision  to  include  a  feature 
into  this  architecture  is  based  on  the  results  from  Chapters 
5  and  6  of  executing  the  given  FFT  algorithms  on  the  seven 
computers.  When  implementing  the  architecture,  at  least  the 
following  three  hardware  features  must  be  considered: 
functional  units,  local  storage,  and  high  speed  buffer 
memory.  But  since  architecture  as  defined  here  includes  the 
compiler  and  operating  system,  the  system  software  must  also 
be  considered.  In  addition,  the  optimization  needed  by  a 
given  architecture  to  improve  its  performance  can  be 
determined.  The  minimum  hardware  for  an  efficient  FFT 
processor  is  shown  in  Figure  7.1. 

Functional  Units. 

A  functional  unit  is  a  hardware  building  block  designed 
to  perform  a  specific  operation.  Separate  pipelined 
functional  units,  as  in  the  CDC  Cyber  750,  decrease  the 
dependence  of  the  execution  speed  on  the  number  of  floating 
operations  by  allowing  certain  instructions  to  execute  in 
parallel.  The  tables  in  Chapter  5  show  that  the  greatest 
number  of  instructions  are  in  the  categories  of  integer 
addition,  floating  addition,  and  floating  multiplication. 
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Therefore,  these  are  the  minimum  functional  units  required. 
Floating  division  always  accounted  for  less  than  0.1%  of  the 
floating  instructions  and  therefore  does  not  warrant  a 
separate  functional  unit.  Each  of  the  functional  units 
should  be  pipelined  and  have  access  to  all  of  the  high  speed 
registers . 

Local  Storage. 

Local  storage  consists  of  the  high  speed  registers 
contained  in  the  CPU.  Local  storage  is  used  by  the 
functional  units  to  store  operands  and  results,  and  also  to 
hold  the  addresses  of  the  operands.  The  floating  point 
registers  and  the  address  registers  need  not  be  the  same 
length.  The  length  of  the  floating  point  registers 
corresponds  to  the  accuracy  of  the  digital  representation, 
while  the  length  of  the  address  registers  corresponds  to  the 
amount  of  memory  used.  Current  architectures  suggest  that  a 
minimum  of  32  bits  be  used  for  the  floating  point  registers 
and  16  bits  for  the  address  registers.  However,  longer 
length  sequences  may  require  more  memory  than  16  bits  can 
address,  thus  the  address  registers  usually  should  be 
longer.  Increasing  the  number  of  high  speed  operand 
registers  available  to  the  functional  units  would  decrease 
the  number  of  data  transfers  required,  and  thus  decrease  the 
execution  speed  of  the  algorithm.  The  paper  by  Nawab  and 
McClellan  suggests  that  the  number  of  registers  necessary 
for  the  efficient  implementation  of  FFT  algorithms  should  be 
at  least  eight  (Nawab  and  McClellan,  1979). 
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High  Speed  Buffer  Memory. 

The  high  speed  buffer  memory  is  between  the  local 
storage  and  main  memory,  both  in  terms  of  size  and  physical 
connections.  All  data  passing  between  main  memory  and  the 
registers  must  pass  through  the  high  speed  buffer,  or  cache, 
memory.  The  data  is  retained  in  the  cache  memory  until  it 
is  replaced  by  other  data.  The  cache  memory  takes  advantage 
of  the  fact  that  a  program  has  a  higher  than  average 
probability  of  referring  to  a  piece  of  data  or  an 
instruction  that  it  has  recently  referenced.  The  use  of 
high  speed  buffer  memory  does  not  affect  the  number  of  data 
transfers,  but  it  does  reduce  the  average  time  to  perform  a 
data  transfer.  Ideally,  the  high  speed  buffer  should  be 
large  enough  to  contain  a  typical  algorithm  loop,  but  with 
some  algorithms  that  is  not  feasible.  In  general,  a  buffer 
memory  of  at  least  4  kwords  is  necessary  to  obtain  a  90%  hit 
ratio  (Baer,  1980). 

System  Software 

In  addition  to  the  hardware  employed,  the  compiler  used 
by  each  machine  must  be  considered.  Compiler  optimization 
will  not  affect  the  number  of  floating  point  operations,  but 
it  will  affect  the  number  of  data  transfers  and  integer 
operations.  In  an  architecture  such  as  this,  which  has 
several  functional  units,  the  compiler  may  reorder  some  of 
the  operations  to  minimize  the  idle  time  of  the  functional 
units.  The  compiler  must  be  matched  to  and  take  optimum 
advantage  of  the  hardware.  Improvements  made  on  the 


VII-4 


compiler  could  affect  the  number  of  data  transfers  and 
consequently  the  results  presented  here.  The  Cray-1  and  VAX 
11/780  used  compilers  that  implemented  the  FORTRAN  77 
language,  while  the  others  used  compilers  that  implemented 
the  older  FORTRAN  IV  language.  The  newer  language  version 
has  some  changes  that  affect  the  performance  of  FFT 
algorithms.  Specifically,  FORTRAN  77  requires  that  all  DO 
loops  be  checked  for  an  initial  value  greater  than  the  final 
value,  and  if  that  condition  is  found,  the  loop  should  not 
be  executed  at  all.  In  the  older  FORTRAN  IV,  a  loop  was 
executed  once  regardless  of  the  initial  and  final  values. 
Thus  the  FORTRAN  77  compiler  inserted  extra  code  into  the 
program  that  was  not  needed  since  the  FFT  programs  had  no 
such  loops.  These  extra  statements  could  only  slow  down  the 
execution  speed  of  the  algorithms.  Thus,  for  FFT 
algorithms,  a  FORTRAN  IV  compiler  will  produce  better  code 
than  an  equivalent  FORTRAN  77  compiler.  However,  some 
FORTRAN  77  compilers,  such  as  the  one  on  the  VAX  11/780, 
have  an  option  to  conform  to  the  older  standard.  This 
option  should  be  used  wherever  possible. 

Optimization  of  Current  Architectures 

For  a  given  currently  available  architecture,  the  areas 
needing  improvement  to  optimize  the  architecture  for  FFT 
execution  can  be  determined  through  the  correlation 
coefficients.  After  each  of  the  FFT  algorithms  has  been  run, 
with  the  execution  speed  measured  and  the  instruction  counts 
determined,  the  correlation  coefficients  between  the 
execution  speed  and  each  major  instruction  category  can  be 


calculated.  Then  the  instruction  category  which  has  the 
largest  correlation  coefficient  can  be  determined.  The  best 
improvement  to  the  architecture  would  be  one  that  reduces 
the  execution  time  of  that  instruction  category.  For 
example,  if  the  data  transfers  had  the  largest  correlation 
coefficient,  then  features  which  improve  the  data  transfer 
rate,  such  as  high  speed  buffer  memory,  should  be  added. 
Alternatively,  this  information  could  be  obtained  by 
counting  the  instructions,  as  before,  and  then  using  the 
individual  instruction  times  to  obtain  a  calculated 
execution  time.  Then  the  correlation  coefficients  or  the 
percentage  of  time  spent  on  each  instruction  category  can  be 
determined.  The  instruction  taking  the  largest  percentage  of 
time  is  the  one  that  needs  improvement.  However,  this 
method  is  not  as  good  as  using  the  correlation  coefficients 
because  it  neglects  the  instruction  overlap.  Architectures 
in  which  a  high  correlation  coefficient  was  measured  between 
the  execution  speed  and  floating  operations,  as  was  done  in 
a  previous  section  for  the  Cromemco  Z-2D,  would  benefit  most 
from  the  addition  of  functional  units.  If  the  correlation 
coefficient  is  the  greatest  between  a  particular  floating 
operation  and  the  execution  speed,  then  that  operation  has 
the  greatest  effect  on  the  execution  speed,  and  improvements 
should  be  made  to  increase  the  speed  of  that  operation, 
which  is  normally  done  through  the  addition  or  improvement 
of  functional  units.  Architectures  with  relatively  slow 
data  transfer  times,  which  is  indicated  by  a  high 
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correlation  coefficient  between  the  execution  speed  and  the 
data  transfers,  would  benefit  most  from  an  increase  in 
registers.  The  architectures  that  would  benefit  most  from 
the  addition  of  a  high  speed  buffer  memory  are  the  ones  with 
a  high  correlation  coefficient  between  the  execution  speed 
and  the  data  transfers,  as  in  the  CDC  Cyber  750. 
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VIII.  Prediction  of  Algorithm  Performance 
Knowledge  of  the  computer  architecture  on  which  an 


algorithm  will  run  can  be  used  to  predict  which  algorithm 
will  have  the  minimum  execution  speed.  To  accomplish  this, 
computer  architectures  can  be  divided  into  three  different 
types:  floating  operation  processors,  data  transfer 
processors,  and  vector  processors. 

Floating  Operation  Processors . 

Floating  operation  processors  are  those  which  execute 
floating  operations  well  and  whose  execution  speed  is 
limited  mainly  by  the  data  transfer  rates  of  the  operands. 
These  architectures  typically  have  a  data  transfer  time 
which  is  greater  than  half  of  the  floating  operation  time 
and  a  high  correlation  coefficient  between  the  execution 
speed  and  number  of  data  transfers.  The  CDC  Cyber  750  is  an 
example  of  a  floating  operation  processor.  These 
architectures  execute  algorithms  with  a  minimum  number  of 
data  transfers  most  efficiently.  Therefore,  ranking  the 
algorithms  according  to  the  number  of  data  transfers,  the 
radix-2  algorithm  would  have  the  minimum  execution  speed, 
followed  by  the  PFA,  WFTA,  and  MFFT.  In  fact,  the  execution 
speed  could  be  roughly  predicted  using  only  knowledge  of  the 
number  of  data  transfers,  independent  of  the  algorithm  used. 
For  example,  the  execution  speed  for  the  algorithms  run  on 
the  CYBER  750  are  plotted  against  the  number  of  data 
transfers  in  Figure  8.1.  The  execution  speed  is  directly 
proportional  to  the  number  of  data  transfers.  The  equation 


for  the  line  of  best  fit  is 

T  -  0.3736  D  -  0.5944  (8-1) 
where  T  is  the  time  in  milliseconds  and  D  is  the  number  of 
data  transfers  in  thousands.  The  average  error  between  the 
points  and  this  line  of  best  fit  is  4.1%.  However,  the 
proportionality  constant  will  vary  with  different  computers, 
but  is  closely  related  to  the  data  transfer  rate  of  the 
computer.  Thus,  for  an  architecture  like  the  Cyber  750,  the 
fastest  algorithm  is  the  one  with  the  fewest  data  transfers 
for  that  particular  sequence  length. 

Data  Transfer  Processors . 

Data  transfer  processors  have  their  execution  speed 
limited  by  the  speed  of  their  floating  point  operations. 
These  processors  have  a  single  multipurpose  functional  unit 
and  usually  a  high  speed  buffer.  The  IBM  370/155  and 
Cromemco  Z-2D  are  examples  of  data  transfer  processors.  The 
correlation  coefficient  between  the  execution  speed  and  the 
floating  operations  is  the  greatest.  Thus,  the  number  of 
data  transfers  does  not  predict  the  execution  speed  as  well 
as  in  the  floating  operations  processors,  as  can  be  seen  in 
Figure  8.2.  The  equation  for  the  line  of  best  fit  is 

T  -  2.6736  D  +  41.2501  (8-2) 
where  T  is  the  time  in  milliseconds  and  D  is  the  number  of 
data  transfers  in  thousands.  The  average  percentage  error 
between  the  actual  execution  times  and  those  predicted  by 
the  line  of  best  fit  is  16.0%,  which  is  about  4  times 
greater  than  the  error  for  the  floating  operations 
processors.  Data  transfer  processors  have  a  data  transfer 
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Figure  8.2.  IB*1  Execution  Sneed  Versus  Data  Transfers 


time  which  is  less  than  half  of  the  floating  operation  time, 
and  execute  algorithms  with  fewest  floating  operations  most 
efficiently.  For  this  architecture,  the  PFA  would  be  the 
fastest  architecture,  followed  by  the  radix-2,  MFFT,  and 
WFTA,  with  the  particular  order  of  the  last  three  depending 
on  the  ratio  of  floating  add  speed  to  floating  multiply 
speed. 

Vector  Processors . 

Vector  processors,  or  array  processors,  have  functional 
units  specifically  for  vector  operations.  The  Cray-1  is  an 
example  of  a  vector  processor.  Vector  processors  execute 
the  initialized  WFTA,  with  its  nested  structure,  most 
efficiently,  followed  by  the  PFA,  radix-2,  and  MFFT,  with 
the  order  of  the  last  three  dependent  on  the  other  features 
available  in  the  processor. 

Now  that  guidelines  for  predicting  algorithm 
performance,  based  on  the  knowledge  of  the  computer 
architecture,  have  been  proposed,  the  next  chapter  will 
summarize  the  results  and  conclusions  of  this  study. 
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Conclusions 

This  study  presented  an  evaluation  of  the  four  major 
FFT  algorithms  on  seven  different  computers.  The  data 
clearly  shows  that  no  one  algorithm  is  faster  than  another 
in  all  cases.  The  reasons  why  certain  algorithms  perform 
better  on  certain  computers  were  explained  in  terms  of  the 
computer  architecture.  The  execution  speeds  were  related  to 
four  different  instruction  categories:  floating 

add/subtract,  floating  multiply/divide,  integer  operations, 
and  data  transfers.  The  average  correlation  coefficients 
were  0.9518,  0.8614,  0.9401,  and  0.9792,  respectively.  In 

all  cases,  the  number  of  data  transfers  was  highly 
correlated  with  the  execution  speeds.  For  floating 
operation  processors,  the  number  of  data  transfers  was  a 
better  predictor  of  the  algorithm  performance,  with  a  4. IX 
error,  than  the  number  of  floating  operations.  In  computer 
architectures  with  several  pipelined  functional  units,  the 
number  of  data  transfers  had  a  major  impact  on  the  execution 
speed,  and  the  radix-2  was  the  fastest  algorithm.  In 
computer  architectures  with  a  single  multipurpose  functional 
unit,  such  as  microcomputers,  floating  operations  had  a 
major  impact  on  execution  speed,  but  data  transfers  were 
still  important.  These  architectures  executed  the  PFA 
fastest.  Architectures  with  vector  operations  executed  the 
initialized  WFTA  fastest. 

The  minimum  computer  architecture  for  the  time 
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efficient  execution  of  FFT  algorithms  was  presented.  This 
architecture  included  several  functional  units  and  a  high 
speed  buffer  memory.  A  qualitative  method  was  given  for 
predicting  the  performance  of  FFT  algorithms  based  on  the 
computer  architecture.  This  information  can  be  used  to 
determine  which  algorithm  should  be  used  for  a  particular 
computer,  and  also  to  determine  the  computer  architecture 
for  dedicated  FFT  processors.  The  correlation  coefficients 
between  the  different  instructions  and  the  execution  speeds 
can  be  used  to  determine  which  areas  of  the  computer 
architecture  need  improvement  in  order  to  optimize  FFT 
algorithm  execution. 

Recommendations 

To  complete  this  study,  the  instruction  timings  of  the 
VAX  11/780  need  to  be  determined.  These  instruction  timings 
can  then  be  used  to  determine  the  percentage  of  time  spent 
on  each  instruction  category. 

The  effects  of  the  FORTRAN  compilers  on  the  execution 
speeds  need  to  be  studied  in  more  detail.  The 
characteristics  of  a  good  FFT  compiler,  and  operating 
system,  need  to  be  determined,  as  well  as  the  optimizations 
that  should  be  performed  to  improve  FFT  execution  speed. 
The  FORTRAN  code  should  be  compared  to  the  same  algortihm 
written  in  assembly  language. 

To  properly  study  the  effects  of  the  computer 
architecture  on  FFT  algorithm  performance,  simulation 
programs  are  needed  for  each  architecture.  However,  no  such 
programs  were  available  at  the  time  of  this  study.  In 
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addition,  a  simulation  program  of  the  proposed  minimum 
architecture  should  be  developed.  The  results  of  the 
simulation  program  could  be  used  to  make  improvements  to  the 
proposed  minimum  architecture.  These  simulation  programs 
should  account  for  instruction  overlaps  and  cache  memory 
hits. 

A  quantitative  method  for  predicting  performance  needs 
to  be  developed.  This  model  should  be  based  on  the 
architectural  features  and  the  algorithm  and  should  result 
in  a  predicted  execution  time. 

The  analysis  presented  in  this  study  needs  to  be 
expanded  to  include  dedicated  FFT  processors  and  array 
processors.  The  results  could  be  used  to  improve  the  design 
and  thus  the  performance  of  dedicated  FFT  processors. 
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APPENDIX  A 


Radix-2  Algorithm  Program 

This  appendix  contains  an  in-place  FFT  program  used  in 
executing  and  timing  the  radix-2  algorithm.  The  program  was 
originally  developed  by  Rabiner  and  Gold  (Rabiner  and  Gold, 
1975)  and  was  modified  by  Route  to  eliminate  the  calls  to 
the  complex  library  function  (Route,  1981).  The  program 
consists  of  a  driver  and  a  single  subroutine.  The 
subroutine  is  called  using  the  format 

CALL  FFT2CM( A , B , M , XIN , XOUT ) 
where  the  variables  are  defined  as  follows. 

A  -A  is  a  real  array  specifying  the  real  components  of 

the  input  sequence  x(n)  and  the  output  sequence  X(k). 

B  -B  is  a  real  array  specifying  the  imaginary  components 
of  the  input  sequence  x(n)  and  the  output  sequence 
X(k). 

M  -M  is  an  integer  specifying  the  power  to  which  two  is 

raised  to  obtain  the  sequence  length. 

XIN  -XIN  is  a  real  value  set  to  the  starting  value  of  the 
realtime  clock. 

XOUT  -XOUT  is  a  real  value  set  to  the  difference  between  the 
final  value  of  the  realtime  clock  and  XIN. 


A-l 


c 


c 

C  PROGRAM  RADIX  2  FFT  DRIVER 
C 

C  THIS  I£  ft  DRIVER  PROGRAM  FOR  THE  SINGLETON  FAST  FOURIER  TRANSFERS 
C  ROUTINE.  IT  L'-i5  ORIGINALLY  URITTEN  TO  BE  RUN  OH  THE  PD?  11/56 
C  OF  AFJAL/FIGX.  IT  IE  URITTEN  IN  FORTRAN  AND  IS  TO  BE  COMPILED 
C  UITH  THE  DEC  F4P  COIFILER 


C  AUTHOR:  MARK  A.  MEHhLIC 

C  DATE:  21  APR  E3 

C  VERSION:  1.6 

C  SUBROUTINES  CALLED:  FFT2CM, GRAPH 

C 

C  HoMotamtSi-ifc:  3W»3>3f3|3|gB>3KJry-’<.-»3fri  'V-isia 

r 

C 

C *********** »  tfafcMoiai •»:**> 


t_ 

C  DECLARE  ALL  VAT.  IAELES  AND  ARRAYS  USED 
C 

CiMot^oBWK)KV^>k*>^^>K5K)t:^;tTf  >K>'>K>K>t':>Mr^)K^*)K*3*OMO^*3K^rS'H:)k)MoW*)totoMo4<)toloK)tc+:** : 

C 

REAL  A(512),B(512) 

PEAL  MAG (512) 

C 

C#*^>K*TK*Mt:*)»nW!K>M:Hasrasa<**)tstnf*3W<*#3Wc#oloWoK*>^ 

C 

C  SET  UP  THE  CONSTANTS  USED  IN  THE  PROGRAM 
C 

c 

N-51? 

M  -  5 

PI-3. I41552SE355 
Tt-e.ci 
E  *  2. 7 1828 
T-e.c 

c 

c 

C  DIGITIZE  THE  CONTINUOUS  FUNCTION 
C 

c  »T*3(o!c**)W(0l«3*ai  Si  >:sK»3K»s(>3f.Tt3la|oio»s*3!  '^>Tto)nMs»s)n»TMo>o».si 
C 

DO  10  I-l.N 

ACI)-(E**C-T))  *  C0S(5B*PI»T) 

BCI)-6. 6 
T-T+Tl 
10  CONTINUE 

CALL  GRAPH (A, N) 

C 
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C  CALL  THE  SUE TOUT IKE  FFT2CM  (MODIFIED  RADIX  2  ALGORITHM) 

C 

c  **#**' •****#**>:  -^H:***:*-  :^TK>WoK*H“40MorW^)K>WKNoto(~T: 

C 

CA.L  FFT2I  *  A.E*! :.•  XIH.XOJT) 

C 

C  *****'>;  ii.^Uf.XaMS  :4:-i  :»;r;:>KiMc#**>io|a|c^jWaK**#>io;aK*****:>K#^ 

C 

C  PRINT  OUT  THE  EXECUTION  TIKE 
C 

c 

PRINT*.  'THE  EXECUTION  TIKE  IE'.XOUT 

C 

C 

C  DETERMINE  THE  MACK ITUEE  OF  THE  FOURIER  TRANSFORM 
C 

c  ***•**'*•>»;:»:;! 3fc***.i*3i :***»aww»3-  :*>ioralaHM^)lak#*)laiat3^>iaki|aMc*****M**>K*)k**:>K*>t: 

C 

DO  20  I-UK 

MAG ( I)  •  ( (A ( I ) ) **2+(BC I) ) **2)**8 . 5 
PRINT*.  All).  BCD.  MAG  ( I) 

20  CONTINUE 
r 


C  CALL  GRAPH  UITH  THE  MAGNITUDE  ARRAY  AND  NUKBER  DF  POINTS 
C 


n  n  o  n  r>  r>  o  o  n  o  it  o  o  o  o  <">  o  <  i  r»  o  o  r> 


*:*.■*  *  »KC*  ^ok*^**-*'***'*-'  Tkikaalc*-*1*  *  iwno-^owieur »  y-'\-<otQ».DL)>  <ntnto».»».>-yotoMntn<Q>.T>y  t******** 


SUBROUTINE  FFT  RAC  IX- 2  FDR  PDF  1 1/EC 


AUTHOR : 
CODED  BY: 
DATE : 

VEF. z ; OK : 


GA"-.'  P.  ROUTE 
MAC  ft.  MEHALIC 
21  AFF.  83 
i .  C 


THIS  SUBPOLTI'C  CPLCULftTES  THE  FAST  FOURIER  TRANSFORM  OF  A  SEQUENCI 
UNOSE  LENGTH  IS  ft  MULTIPLE  OF  2.  IT  HAS  BEEN  SPECIFICALLY 
MCI  IF  I E I’  FOR  THE  PSP  11  SERIES  OF  COMPUTERS  SO  THAT  THE  CALL 
TO  THE  COMPLEX  FUNCTIONS  OF  THE  LIBRARY  HAVE  BEEN  ELIMINATED. 
AS  A  RESULT.  IT  L'iLL  RUN  FASTER  THAN  THE  FFT2C  ALGORITHM. 


SUBROUTINE  FFT2CK C  A .  E .  M.  X!  N  > XOUT)  | 


DIM^NSIOh  THE  INPUT  (AND  OUTPUT)  ARRAYS  FOR  THE  SEGUENCE  LENGTH 


C  *:***: 
C 


5 

6 


? 


X:)totQ»o4oioto*otQt( 


DIMENSION  A (512).  BCE  12) 
XIK*£ECKDS  f 0 . G) 

NV2*N/2 
r!MlEH-  I 
>1 

r:  ?  i  :  .nmi 
r-  (I.GE.J)  GO  TO  5 
rr  =  AO 

TT  -  p  /  T> 

A  C  J)  -  AC  I) 

BCJ)  *  ECU 
AC  I)  -  TR 
BCD  -  TI 
K  ■  NV2 

IF  (K.GE.J)  GO  TO  ? 

J  -  J-K 
K  -  Y.s  2 
GO  TC  6 
J  «  J+L* 

PI  •  3. 1415S2S535B9B 
DO  2P  L*  DM 
LE  •  2**L 
LEI  -  LE/2 

ur  -  1.0 

UI  -  0.0 

UR  •  COSCPi/LEl) 


A-4 


nnnnn  n n n r> n 


u:  «  SIHlP ' /_E I) 

r:  20 


I'D 

iC-  i-j.n.le 

IP  - 

I+LE1 

TR  = 

ftCr5**JR-B< 

IP)*UI 

TI  - 

AC  Irl'iCI+BC 

if;*ur 

ft  £  IF  3 

-  All) -TR 

EC  IF) 

•  E  £  1 3  —  T I 

AC!) 

=  A  I • +TR 

10 

Eli) 

«  El !)+TR 

TR 

=  ur.' 

".F-U  I 

UI 

*  U". 

■.  j:+ui*ur 

23 

UR 

=  TR 

c 

C*T(Otofc*OK>KJtOfc^!<OKWSoi3K>l^'".  .Tr:>ntaMOIOkir3»8lcM:i|aK)ftataMa|0|aiOlO|nt3jo^>K)fc>!nKValaMc>k>Moiai'Ntf 
CALCULATE  THE  EXECUTION  TIME 
sMoWKiioWloMoKttJMoKiMotaiotoMoraotaiQfcaJtataoMMoloWloKiMloloMatoMoM^ 

XOUT  -  SECMDS(XIN) 


END  OF  SUEr  CL'TIKF 


RETURN 

END 
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APPENDIX  B 


Singleton’s  Algorithm  Program 
This  appendix  contains  the  driver  and  subroutine 
programs  used  to  execute  and  time  the  Singleton  mixed  radix 
algorithm.  The  algorithm  is  divided  into  two  subroutines: 
FFTSNG  and  FFTMX.  FFTSNG  does  the  sequence  length 
factorization  and  calls  FFTMX,  which  does  the  short 
transforms.  The  subroutines  are  called  using  the  format 
CALL  FFTSNG ( A , B , NSEG , N , N SPN , I SN , AT , CK , BT , SK , KD , NP , NPM , 

XIN,XOUT) 

and 

CALL  FFTMX( A , B , NTOT , NF , NSPAN , ISN , M , KT , AT , CK , BT , SK , KD , NP , 

NPM.NFAC) 

where  the  variables  are  defined  as  follows. 

N  -N  is  the  integer  sequence  length. 

A  -A  is  a  real  array  of  dimension  N.  When  FFTSNG  is 
called,  A  contains  the  real  components  of  the  input 
sequence  x(n).  On  return  from  FFTNSG,  A  contains  the 
real  components  of  the  output  sequence  X(k). 

B  -B  is  a  real  array  of  dimension  N.  When  FFTSNG  is 
called,  B  contains  the  imaginary  components  of  the 
sequence  x(n).  On  return  from  FFTSNG,  B  contains  the 
imaginary  part  of  the  output  sequence  X(k). 

NSEG  -NSEG  is  an  integer  equal  to  the  total  number  of 
complex  data  values  divided  by  (N*NSPN).  Usually, 
NSEG-1 . 

NSPN  -NSPN  is  an  integer  specifying  the  spacing  of 
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consecutive  data  values.  Usually,  NSPN-1. 

ISN  -ISN  is  an  integer  specifying  the  direction  of  the  FFT. 
If  ISN  is  negative,  a  forward  transform  is  performed. 
If  ISN  is  positive,  an  inverse  transform  is  performed 
scaled  by  1/N. 

AT,BT  -AT  and  BT  are  real  arrays  dimensioned  as  the  maximum 
of  1.)  the  largest  odd  prime  factor,  or  2.)  the 
product  of  the  square  factors  of  N.  They  are  used  for 
temporary  storage  during  the  transform  of  an  odd  prime 
factor  greater  than  5  and  during  the  permutation  of  the 
square  free  factors. 

CK,SK  -CK  and  SK  are  real  arrays  dimensioned  at  least  equal 
to  the  largest  odd  prime  factor.  They  are  used  to 

provide  temporary  storage  during  the  transform  of  an 
odd  prime  factor  greater  than  5. 

KD  -KD  is  an  integer  specifying  the  number  of  square 
factors  in  N. 

NP  -NP  is  an  integer  array  of  dimension  at  least  NPM.  It 
is  used  to  store  the  digit  reversed  order  of  the  square 
free  factors. 

NPM  -NPM  is  an  integer  value  equal  to  the  product  of  the 
square  free  factors  of  N. 

XIN  -XIN  is  a  real  value  used  to  hold  the  initial  value  of 
the  real  time  clock. 

XOUT  -XOUT  is  a  real  value  containing  the  amount  of  time 
required  for  the  subroutine  to  execute. 
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DIMENSION  fl(2520) ,G (2520) 

DIMENSION  XftVG (4) 

DIMENSION  ftT(7).CK(7),BT(?;*,SK(?).NP(?0) 

NSEG  -  1 
NSPN  -  1 
ISN  -  -1 
KD  -  7 
NPM  -  70 

READ(5.*)  NUMTIM 
DO  9S9  INUM-l, NUMTIM 
READ(5.*)  T.DELT.LIST 
C  URITE(6.601>  T, DELT, L 1ST 

601  FORMAT! 1H1. 'TEST  WITH  T  -'.F10.2,'  DELT  -  Fie. 2. 'LIST  OPT-', 13) 

LCOUI1T  -  0 
111  CONTINUE 

L COUNT  -  LCOUNT  +  1 
NPTS  -  (T/DELT)  +0.4 
DELF  -  1.0/T 
PI  -  3. 141592C53C98 
E  -  2.71B2E 
FO  -  25.0 
DO  10  I -1,  NPTS 
Ti  -  ( 1  —  15 » DELT 
FEE  -  2.0>:?I*rC*Tl 
TE  -  -l.e«Tl 
Ed)  *6.0 

10  fid)  -  (E**TE)*COS(ftRG) 

CAUL  FFTSNG ( A .  B . NSEG , NPTS , NSPN , ISN . fiT. CK , BT. SK , KD , NP , NPM, XIN . XOUT) 
PRINT#, 'XOUT'  ,XOUT 
XAVG (LCOUNT)  -  XOUT 
DO  3C  I- 1. NPTS 

30  flCI)  *  ((A(I)**2)  +  (B(l)**2))  **  0.5 

PRINT#,  NPTS 

IF (LCOUmT.LT. 3)  GO  TO  111 
TAVG  -  0. 

DO  599  J-l.3 

TAVG  -  TAVG  +  XAVGCJ) 

599  CONTINUE 
TAVG  •  TftVG/3 

PRINT*.  'flVG  XOUT  -  '.TflVG 
IFCLIST.NE. 1)  GO  TO  999 
Lf?ITEf6.6C0) 

600  FORMAT ( '  1'  ,20X, '  MAG' ,  16X, '  IMflG' ,  16X,  16X./) 

UR ITE (6. 700)  (I, fl(I), B(I). I- 1. NPTS) 

700  FORMAT (5X.  15.5X.E14.7.5X.E14.7.5X) 

999  CONTINUE 
STOP 
END 
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c 

c* 

c 


C  SUBROUTINE  SINGLETON  MIXED  RADIX  FFT  ALGORITHM 
C 

C  THIS  SUBROUTINE  IS  TAKEN  FROM  'PROGRAMS  FOR  DIGITAL  SIGNAL 
C  PROCESSING'  PUBLISHED  BY  THE  IEEE. 

C 

C 

r 


c 

C  THE  SUBROUTINE  STATEMENT  UAS  MODIFIED  TD  INCLUDE  THE  INPUT  AND 
C  OUTPUT  TIMES  Cl!  21  AFP.  S3.  IN  ADDITION,  THE  CALL  TO  THE 

C  SECONDS  SUBROUTINE  TO  CALCULATE  EXECUTION  TIME  UTS  ADDED. 

C 

C 

SUBROUTINE  FFTSNG (A- B.N£EG,N, NSPN. ISN, AT. CK. BT, SK. KD, NP. NPM. 
1XIN.XOUT) 

C 

C 

C  THE  COMMENTS  PRECEDED  BY  CC  ARE  THE 

C  ORIGINAL  CARDS  .  THE  CHANGES  WERE  MADE 

C  BECAUSE  THIS  FORTRAN  COMPILER  CANNOT 

C  ALLOCATE  AND  DE-ALLOCATE  MEMORY 

C 
C 

c 

c . . - - - 

C  SUBROUTINE:  FFT 

C  MULTIVARIATE  COMPLEX  FOURIER  TRANSFORM,  COMPUTED  IN  PLACE 
C  USING  MIXED-RADIX  FAST  FOURIER  TRANSFORM  ALGORITHM. 

C . — - - - - 

c 

c 

C  ARRAYS  A  AND  B  ORIGINALLY  HOLD  THE  REAL  AND  IMAGINARY 
C  COMPONENTS  OF  THE  DATA,  AND  RETURN  THE  REA-  AND 

C  IMAGINARY  COMPONENTS  OF  THE  RESULTING  FOURIER  COEFFICIENTS. 

C  MULTIVARIATE  DATA  IS  INDEXED  ACCORDING  TO  THE  FORTRAN 
C  ARRAY  ELEMENT  SUCCESSOR  FUNCTION,  WITHOUT  LIMIT 

C  ON  THE  NUMBER  OF  IMPLIED  MULTIPLE  SUBSCRIPTS. 

C  THE  SUBROUTINE  IS  CALLED  ONCE  FOR  EACH  VARIATE. 

C  THE  CALLS  FOR  A  MULTIVARIATE  TRANSFORM  MAY  BE  IN  ANY  ORDER. 

C 

C  N  IS  THE  DIMENSION  OF  THE  CURRENT  VARIABLE. 

C  NSPN  IS  THE  SFACING  OF  CONSECUTIS-^  DATA  VALUES 
C  WHILE  INDEXING  THE  CURRENT  VARIABLE. 

C  NSEG#N*NSPN  IS  THE  TOTAL  NUMBER  OF  COMPLEX  DATA  VALUES. 

C  THE  SIGH  OF  ISN  DETERMINES  THE  SIGN  OF  THE  COMPLEX 
C  EXPONENTIAL.  AND  THE  MAGNITUDE  OF  ISN  IS  NORMALLY  ONE. 

C  THE  MAGNITUDE  OF  ISN  DETERMINES  THE  INDEXING  INCREMENT  FOR  A&B. 
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IF  FFT  IS  CALLED  TWICE.  WITH  OPPOSITE  SIGHS  ON  ISM.  AN 

IDENTITY  TRANSFORMATION  IS  DONE... CALLS  CAN  BE  IN  EITHER  ORDER. 
THE  RESULTS  ARE  SCALED  BY  1/N  UHEN  THE  SIGN  OF  ISN  IS  POSITIVE. 

*  TRI-VARIATE  TRANSFORM  UITH  ACN1.N2.N3),  B(N1,N2.N2) 

IS  COMPUTED  BY 

CALL  FFT(A,B.N2*N3,Nl. l.-l) 

CALL  FFT(A,B,N3.N2,ND-D 
CALL  FFTCA.B,  l.N3,Nl*N2,-D 

A  SINGLE-VARIATE  TRANSFORM  OF  N  COMPLEX  DATA  VALUES  IS  COMPUTED  BY 
CALL  FFTCA.B,  l.N.  U-l) 

THE  DATA  MAY  ALTERNATIVELY  EE  STORED  IN  A  SINGLE  COMPLEX 
ARRAY  A.  THEN  THE  MAGNITUDE  OF  ISM  CHANGED  TO  TUO  TO 
GIVE  THE  CORRECT  INDEXING  INCREMENT  AND  A (2)  USED  TO 
PASS  THE  INITIAL  ADDRESS  FOR  THE  SEQUENCE  OF  IMAGINARY 
VALUES,  E.G. 

CALL  FFT (A, A (2) . MSEG.N.NSPN.-2) 


ARRAY  KFAC  IS  DORKING  STORAGE  FDR  FACTORING  N. 
NUi  HER  EXCEED liiG  THE  15  LOCATIONS  PROVIDED 


THE  SMALLEST 
IS  12.754.5S4. 


DIMENSION  A  CH) . E (K) .NFRC ( 15) . P.TCKD)  ,BT(KD)  .CK  CKD) . SIC(KD)  ,NP (NPM) 


COMMON  /CSTAK/  DSTAK C25GD) 

DOUBLE  PRECISION  DSTAK 
INTEGER  ISTAKC50GC) 

REAL  RSTAK (5BGQ) 

EQUIVALENCE  (DSTAK(I). ISTAK(l)) 

EQUIVALENCE  (DSTAK ( 1 ) . RSTAK  CD) 

DETERMINE  THE  FACTORS  OF  N 

XIN-SECNDSCB.0) 

M  -  0 

NF  -  IAESCN) 

K  -  NF 

IF  (NF.EC.D  RETURN 
NSPAN  -  IABS (NF*NSPN) 

NTOT  «  IABS(NSPAN*NSEG) 

IF  C ISN*NTOT. NE . 0)  GO  TO  20 
C  IERR  -  I1MACHC4) 

C  URITE  (IERR. 9999)  NSEG.  N,  NSPN.  ISN 
C999  FORMAT  (31H  ERROR  -  ZERO  IN  FFT  PARAMETERS.  4110) 
PRINT*.' ISN*TOT.NE.0' 

RETURN 

C 

10  M  ■  M  +  1 
NFAC(M)  -  4 
K  -  K/1G 

20  IF  CK-(K/16)*16.EQ.0)  GO  TO  10 


J  -  3 
JJ  «  9 
GO  TO  40 
33  M  -  M  +  1 
NFACCM)  -  J 
K  -  K/JJ 

43  IF  (MOD(K,JJ>.EG.0)  GO  TO  30 
J  -  J  +  2 
JJ  -  J**2 

IF  (JJ.LE.K)  GO  TO  40 
IF  CK.GT.4)  GO  TO  50 
KT  «  M 

NFACCM+1)  -  K 
IF  (K.NE.  1)  M  -  M  +  1 
GO  TO  90 

50  IF  (K- (K/4) *4. KE . 0)  GO  TO  60 
M  •  M  +  1 
KFACCM)  -  2 
K  -  K/4 

C  ALL  SGUftRE  FACTORS  OUT  NOU.  EUT  K  .GE.  5  STILL 
60  KT  •  M 

MAXP  «  MAX0CKT+KT+2.K-1) 

J  -  2 

?G  IF  (MODCK.J) .NE.0)  GO  TO  8S 
M  -  M  +  1 
NFACCM)  -  J 
K  -  K/J 

80  J  -  C(J+n/2)*2  +  1 
IF  (J.LE.K)  GO  TO  70 
9S  IF  (M.LE.KT+1)  MAXP  -  (1  +  KT  +  1 
IF  (M+KT.GT. 15)  GO  TO  123 
IF  (KT.EQ.0)  GO  TO  110 
J  -  KT 

100  M  •  M  +  1 

NFACCM)  -  NFACCJ) 

J  -  J  -  1 

IF  CJ.NE.0l  GO  TO  100 
C 

110  MAXF  •  M  -  KT 

MAXF  -  NFACCMAXF) 

IF  CKT.GT.0)  MAXF  -  MAX0CNFRCCKT) .MAXF) 

CC  J  •  ISTKGTCMAXF*4.3) 

CC  JJ  -  J  +  MAXF 

CC  J2  -  JJ  +  MAXF 

CC  J3  ■  J2  +  MAXF 

CC  K  -  ISTKGT CMAXP,  2) 

CC  K  •  ISTKGTCMAXP.2) 

CC  CALL  FFTMXCA.  B.  NTOT,  NF.  NSPAN.  ISN.  M.  KT.  RSTAKCJ). 

CC  *  RSTAKCJJ).  RSTAKCJ2).  RSTAKCJ3).  ISTAKCK).  NFAC) 

CALL  FFTMXCA. B. NTOT. NF. NSPAN. ISN.H.KT, AT.CK.BT.SK.KD.NP.NPM.NFAC) 
CC  CALL  ISTKRL (2) 

C . . 

c 


c  C3DE  ADDED  TO  CALCULATE  EXECUTION  TIME  —  21  APR  83 
C 

C 

XOUT-SECNDS(XIH) 

RETURN 

C 

C 120  IERR  -  I 1MACH (4) 

C  URITE  (1ERR.9S98)  N 

CS98  FORMAT  (50H  ERROR  -  FFT  PARAMETER  N  HAS  MORE  THAN  15  FACTORS-. 
120  PRINT*.'N  HAS  MORE  THAU  15  FACTORS' 

RETURN 

END 

C 

SUBROUTINE  FFTMXCA.B.NTOT.N.N3PAN  . ISN.M.KT. AT.CK.BT.SK. 

ci:d.np.npm.nfac> 

f*  ^ 

C  SUBROUTINE:  FFTMX 

C  CALLED  BY  SUBROUTINE  'FFT'  TO  COMPUTE  MIXED-RADIX  FOURIER  TRANSFORM 
C - 

c 

CC  SUBROUTINE  FFTMXiA.B.NTOT. NF.NSPAK. ISN.M.KT.NFAC) 

C 

DIMENSION  A(N) .E(N) . AT(KD) .ET(KD)  .CK(KD)  .SK(KD) .NP(NPM) 
DIMENSION  NFACCISj 

C 

INC  *  IAESCISN) 

NT  -  inc*i;tgt 

KS  -  INC*NSPAH 
RAD  -  ATANC1.8) 

S72  «  RAD.'0.e25 
C72  «  COS (S72) 

S72  -  SINIS72) 

S 120  -  SCF:T (0.75) 

IF  (ISN.GT.0)  GO  TO  10 
S72  -  -S72 
S120  -  -S 123 
RAD  -  -RAD 
GO  TO  30 
C 

C  SCALE  BY  1/H  FOR  ISN  .GT.  0 
C 

10  AK  -  I.0/FLOATCN) 

DO  28  J-l.NT, INC 
A(J)  «  A(J)*AK 
B( J)  ■  B(J)*AK 
20  CONTINUE 
C 

30  KSPAN  •  KS 
NN  -  NT  -  INC 
JC  -  KS/N 
C 

C  SIN.  COS  VALUES  ARE  RE- INITIALIZED  EACH  LIM  STEPS 
C 


B-7 


ooooooon  non  non 


LIM  -  32 
KLIM  •  LIM*JC 

I  -  e 

JF  -  0 

MAXF  ■  M  -  KT 
MAXF  -  NrAC(MAXF) 

IF  (KT.GT.G)  MAXF  «  MAXO(NFAC(KT) .MAXF) 

COMPUTE  FOURIER  TRANSFORM 

40  DR  -  B.0*FLOAT(JC)/FLCAT(KSPAN) 

CD  -  2.0*SIN(0.5*DR*RAD)*si2 
SD  -  SIN(DR*RAD) 

KK  -  l 
I  -  I  +  1 

IF  (NFAC(I) .NE.2)  GO  TO  110 

TRANSFORM  FOR  FACTOR  OF  2  (INCLUDING  ROTATION  FACTOR) 

KSPAN  -  KSPAN/2 
K 1  *  KSPAN  +  2 
50  K2  -  KK  +  KSPAN 
P.K  *  A (K2) 

BK  -  B(K2) 

A (K2)  -  A (KK)  -  AK 

E1K2)  *  B (KK)  -  BK 

A(KK)  «  A(KK)  +  AK 

B  (KK)  -  B(KIC)  +  BK 

KK  -  K2  +  KSPAN 

IF  (KK.LE.NN)  GO  TO  50 
KK  -  KK  -  NN 
IF  (KK.LE.JC)  GO  TO  50 
IF  (KK.GT. KSPAN)  GO  TO  350 
60  Cl  -  1.0  -  CD 

si  -  sr 

MM  -  MIN0(K 1/2. KLIM) 

GO  TO  80 

70  AK  -  Cl  -  (CD*€1+SD*S1) 

SI  -  (SD*C1-CDXS1)  +  Si 

THE  FOLLOWING  THREE  STATEMENTS  COMPENSATE  FOR  TRUNCATION 
ERROR.  IF  ROUNDED  ARITHMETIC  IS  USED.  SUBSTITUTE 
Cl-AK 

Cl  -  0.5/(AK**2+Sl**2)  +  0.5 
SI  -  Ct*Sl 

Cl  -  C l*AK 

Cl-AK 

80  K2  •  KK  +  KSPAN 
AK  -  A(KK)  -  A(K2) 

BK  -  B(KK)  -  BCK2) 

A(KK)  •  A(KK)  +  A(K2) 

B(KK)  -  B(KK)  +  B(K2) 

A(K2)  -  Cl*AK  -  S 1 #8K 
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B(K2)  ■  S 1*AK 

+  1 

C1*3K 

KK 

-  K2  +  KSPAN 

IF 

(KK.LT.NT) 

GO 

TO 

B0 

K2 

•  KK  -  NT 

Cl 

-  -Cl 

KK 

-  K1  -  K2 

IF 

(KK.GT.K2) 

GO 

TO 

B0 

KK 

-  KK  +  JC 

IF 

(KK.LE. MM) 

GO 

TO 

70 

IF 

(KK.LT.K2) 

GO 

TO 

90 

K 1 

«  K1  +  INC 

+ 

INC 

KK 

-  (Kl-KSPAN)/2  + 

JC 

IF 

(KK.LE. JC+JC) 

GO 

TO  GO 

GO 

TO  40 

90  SI  -  FLOAT ( (KK- ! ) /JC ) *DR*RAD 
Cl  «  COS(Sl) 

SI  -  SIN(Sl) 

MM  -  MIN0(K1/2,MM+KLIM) 

GO  TO  80 

TRANSFORM  FOR  FACTOR  OF  3  (OPTIONAL  CODE) 

100  K1  *=  Kl<  +  KSPAN 
K2  -  !<1  +  KSPAN 
AK  -  A (KK) 

BK  -  B(KK) 

AJ  -  A (K 1 )  +  A(K2) 

BJ  -  B ( K 1 )  +  B(K2) 

A (KK)  -  AK  +  AJ 
B(KK)  -  BK  +  BJ 
AK  -  -0.5X-AJ  +  AK 
BK  -  -0.5»EJ  +  BK 
AJ  -  (A(KP-A(K2))*S120 
BJ  -  (B(KP-BCK2) XS12G 
A(K1)  *  AK  -  EJ 
B(K1)  -  BK  +  AJ 
A(K2)  -  AK  +  BJ 
B(K2)  ■  BK  -  AJ 
KK  -  K2  +  KSPAN 
IF  (KK.LT.NN)  GO  TO  100 
KK  -  KK  -  NN 

IF  (KK.LE. KSPAN)  GO  TO  100 
GO  TO  230 

TRANSFORM  FOR  FACTOR  OF  4 

110  IF  (NFAC( I) .NE. 4)  GO  TO  230 
KSPHN  -  KSPAN 
KSPAN  ■  KSPAN/4 
120  Cl  -  1.0 
SI  "0 

MM  -  MIN0 (KSPAN, KLIM) 

GO  TO  150 

130  C2  -  Cl  -  (CD*Cl+SD*5l) 


oooonnnn 


51  -  (SD*C1-CD*'SI)  +  SI 

THE  FOLLOWING  THREE  STATEMENTS  COMPENSATE  FOR  TRUNCATION 
ERROR.  IF  ROUNDED  ARITHMETIC  IS  USED.  SUBSTITUTE 
C1-C2 

Cl  -  0.5/'(C2**2+Sl**2)  +  0.5 

Sl«Cl*Si 

Cl  -  Cl*C2 

C1-C2 

140  C2  -  Cl**2  -  S 1**2 

52  -  C 1 1*2.0 

C3  -  C2*C1  -  S2*S1 

53  -  C2*S1  +  S2*C 1 
150  K1  «  KK  +  KSPAN 

K2  -  K1  +  KSPAN 
K3  -  K2  +  KSPAN 
AKP  -  ACKK)  +  A  CK2) 

AKM  *  ACKK)  -  ACK2) 

AJP  -  A(Kl)  +  ACK3) 

AJI-1  -  A  ( K 1 )  -  ACK3) 

ACKK)  «  AKP  +  AJP 
AJP  «  AKP  -  AJP 
BKP  ■=  BCKIO  +  BCK2) 

BI'.M  *  BCKIO  -  B(i:2) 

EJP  «  BtKl)  +  ECK3) 

BJM  -  BCKi)  -  BCK3) 

E(KK)  •  EKP  +  BJP 
EJP  •  BKP  -  BJP 
IF  (ISM.LT.0)  GO  TO  1B0 
AKP  -  AKM  -  BJM 
AKM  »  AKM  +  EJM 
EKP  -  BKM  +  AJM 
BKM  -  BKM  -  AJM 
IF  (S1.EQ.0.B)  GO  TO  ISO 
160  ACK1)  ■  AKP*C1  -  BKP*S1 

BCKI)  •  AKP*Sl  +  BKP*C1 

ACIC2)  -  AJP*  C2  -  BJP*S2 

BCK2)  -  AJP*S2  +  BJP*C2 

ACK3)  -  AKM*C3  -  BKM*S3 

BCK3)  -  AKM*S3  +  BKM*C3 

KK  «  K3  +  KSPAN 

IF  CKK.LE.NT)  GO  TO  150 
170  KK  -  KK  -  NT  +  JC 

IF  CKK.LE.MM)  GO  TO  130 
IF  CKK.LT. KSPAN)  GO  TO  200 
KK  -  KK  -  KSPAN  +  INC 
IF  CKK.LE.JC)  GO  TO  120 
IF  CKSPAN.EO. JC)  GO  TO  350 
GO  TO  40 

1B0  AKP  -  AKM  +  BJM 
AKM  •  AKM  -  BJM 
BKP  -  BKM  -  AJM 
BKM  ■  BKM  +  AJM  - 
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IF  (SI. HE. 0.0)  GO  TO  1S0 

190  A(K1)  ■  AKP 
B(K1)  -  BKP 
A(K2)  -  AJP 
A(K3)  -  AKM 
B(K2)  -  BJP 
B(K3)  -  BKM 
KK  -  K3  +  KSPAN 
IF  (KK.LE.NT)  GO  TO  153 
GO  TO  170 

203  SI  -  FLOAT ( (KK- 1 ) /JO  *DR*RAD 

ci  -  coscsn 
si  -  siNcsn 

MM  -  MIN0 (KSPAN. MM+KLIM) 

GO  TO  143 

TRANSFORM  FOR  FACTOR  OF  5  (OPTIONAL  CODE) 

210  C2  -  C72**2  -  S 72**2 
S2  -  2.0*C72*S?2 

223  Kl  -  KK  +  KSPftli 
K2  ■  Kl  +  KSPfti! 

K3  «  K2  +  KSPAN 
K4  -  K3  +  KSPAN 
AKP  -  A(KI)  +  ACK4) 

AKM  •  A(K1)  -  ACK4) 

Bl  P  -  B  C K 1 )  +  BCK4) 

AJP  -  ACIC2)  +  ft(K3) 

AJM  -  A(K2)  -  A(i(3) 

BJP  -  B(K2)  +  B(K3) 

BJM  -  B(K2)  -  B(K3) 

AA  -  A (KK) 

BB  -  B (KK) 

A(KK)  ■=  Aft  +  AKF  +  AJP 

BUCK)  -  BE  +  EKP  +  BJP 

AK  -  AI(P*C72  +  AJF*C2  +  AA 

BK  -  BKF*C72  +  BJP*C2  +  BB 

AJ  *  AI(M-:-S72  +  RJM*S2 

BJ  -  BKM*S72  +  EJM-IS2 

R(K1)  -  AK  -  BJ 

A  (1(4)  -  AK  +  BJ 

B(K1)  «  BK  +  AJ 

B(l(4)  -  Bl(  -  AJ 

AK  -  AKP*C2  +  AJP*C72  +  AA 

BK  ■  BKP*C2  +  BJF*C72  +  BB 

AJ  -  AKM*S2  -  AJM*S72 

BJ  «  BKM*S2  -  BJM*S72 

A(K2)  -  AK  -  BJ 

A(K3)  -  AK  +  BJ 

B(K2)  -  BK  +  AJ 

B(K3)  -  BK  -  AJ 

KK  «  K4  +  KSPAN 

IF  (KK.LT.NN)  GO  TO  220 

KK  -  KK  -  NN 
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IF  (KK.LE. KSPAN)  GO  TO  220 
GO  TO  2S0 
C 

C  TRANSFORM  FOR  ODD  FACTORS 
C 

230  K  -  NFAC(I) 

KSPNN  -  KSPAN 

KSPAII  =  KSPAN/K 

IF  CK.EQ.3)  GO  TO  103 
IF  (K.EQ.5)  GO  TO  210 
IF  (K.EQ.JF)  GO  TO  250 
JF  =  K 

SI  -  RAD/'(FLOAT(K)/'8.0) 

Cl  -  COS(Sl) 

SI  -  S  IN(Sl) 

CKCJF)  -  1.0 
SKCJF)  «  B.0 
J  »  1 

240  CKCJ)  -  CK(IO*Cl  +  SK(K)*Si 

SKCJ)  «  CKCK)*S1  -  SK(IO*Cl 

K  -  K  -  1 

CKCK)  =  CKCJ) 

SKCK)  *  -SKCJ) 

J  =  J  +  1 

IF  CJ.LT.K)  GO  TO  240 
253  l<  1  =  Klx 

KK  +  KSPNN 
ACKK) 

BCKIO 
AA 


K2 
AA 
ED 
AK 
BK 
J  ■ 


BB 


l 


2G0 


Kl  - 

Kl 

+ 

KSP 

AN 

K2  - 

K2 

- 

KSPAN 

J  -  J 

1  + 

1 

AT(  J) 

B 

ACK1) 

+ 

ACK2) 

AK  » 

ATCJ) 

+ 

ai: 

BTC  J) 

■ 

BCK!) 

+ 

B 

(K2) 

BK  - 

ET(J) 

+ 

BK 

J  -  J 

+ 

1 

AT  ( J) 

■ 

ft(Kl) 

- 

A 

CK2) 

BTCJ) 

B 

BCK1) 

- 

B 

CK2) 

Kl  - 

Kl 

+ 

KSPAN 

270 


IF  (K1.LT.K2)  GO  TO  2G0 
A(KK)  •  Al< 

BCKK)  -  BK 
K 1  -  KK 

K2  -  KK  +  KSPNN 
J  -  1 

Kl  ■  Kl  +  KSPAN 
K2  •  K2  -  KSPAN 
JJ  -  J 
AK  -  AA 

BK  -  BB 
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AJ  -  8.0 
BJ  -  0.0 
K  -  1 

230  K  «  K  +  1 

AK  -  AT(K)*CK(JJ)  +  AK 
BK  -  BTOOXCK(JJ)  +  BK 
K  -  K  +  1 

AJ  -  AT(K)*SK(JJ)  +  AJ 
BJ  -  BT (K) *SK ( JJ)  +  BJ 
JJ  -  JJ  +  J 

IF  (JJ.GT.JF)  JJ  -  JJ  -  JF 
IF  (K.LT.JF)  GO  TO  2B0 
K  -  JF  -  J 
ACK1)  -  AK  -  BJ 

BCK1)  -  BK  +  AJ 

A (K2)  -  AK  +  BJ 

B(K2)  «  BK  -  AJ 

J  -  J  +  1 

IF  CJ.LT.K)  GO  TO  270 
KK  «  KK  +  KSPi-iN 
IF  (KK.LE.NIi)  GO  TO  250 
KK  -  KK  -  NN 

IF  (KK.LE. KSPAN)  GO  TO  250 

MULTIPLY  EY  ROTATION  FACTOR  (EXCEPT  FOP  FACTORS  OF  2  AND  4) 

290  IF  (I.EC.M)  GO  TO  35E 
KK  -  JC  +  1 
300  C2  *  1.0  -  CD 
SI  -  SD 

MM  -  MING(KSPAN,KLIM) 

GO  TO  32e 

310  C2  «  Cl  -  (CD*C1+SD*S1) 

SI  -  SI  +  (SD*C1-CD*SI) 

THE  FOLLOWING  THREE  STATEMENTS  COMPENSATE  FOR  TRUNCATION 
ERROR.  IF  ROUNDED  ARITHMETIC  IS  USED,  THEY  MAY 
BE  DELETED. 

Cl  -  0.5/'(C2**2+Sl**2)  +  0.5 

51  -  C 1*S 1 
C2  -  Cl*C2 

320  Cl  -  C2 

52  -  SI 

KK  -  KK  +  KSPAN 
330  AK  -  A(KK) 

A(KK)  -  C2*AK  -  S2*B (KK) 

B(KK)  -  S2*AK  +  C2*B(KK) 

KK  -  KK  +  KSPNN 

IF  (KK.LE.NT)  GO  TO  330 

AK  -  S1*S2 

S2  -  S1*C2  +  C1*S2 

C2  -  C1*C2  -  AK 

KK  -  KK  -  NT  +  KSPAN 
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IF  (KK.LE.KSPNN)  ED  TO  330 
KK  -  KK  -  KSPliN  +  JE 
IF  (KK.LF.Itl)  GO  TO  310 
IF  (KK.LT.KEPAi!)  GO  TO  340 
KK  -  KK  -  KSPAU  +  JC  +  I  HE 
IF  (KK.LE. JC+JE)  GO  TO  303 
GO  TO  40 

340  SI  ■  FLOfiTC  (KK- 1  > /JE ) »:'DDx HF;D 
E2  -  COS (SI) 

SI  -  SINCSl) 

MM  -  MIN0 (KSPAN,  KM+KL IM) 

GO  TO  320 

PERMUTE  THE  RESULTS  TO  NORMAL  ORDER - DON”  IN  TVO  STAGES 

PERMUTATION  FOE  SOUAF.E  FAOTGf.'E  OF  II 

350  NP(  1)  -  KS 

IF  (KT.EQ.0)  GO  TO  442 
K  «  KT  +  KT  +  1 
IF  (M.LT.Iw  K  «  K  -  1 
J  -  1 

UPCI'.+l)  -  JC 

360  liPvJ+i)  -  l!F(J)/NFAC(J) 

KPti:)  *  i;pu-:+i)h::;ff.C(Jj 
j  -  j  +  i 

K  -  K  -  l 

IF  (J.LT.K)  GO  TO  360 
K3  «  NP(K+l) 

KSPAN«NP (2) 

KK  -  JC  +  1 
K2  -  KSPAN  +  1 
J  -  l 

IF  (N.NE.HTOT)  GO  TO  400 

PERMUTATION  FOP  SINGLE-VARIATE  TRANSFORM  (OPTIONAL  CODE) 

370  AK  «  A(KK) 

A (KK)  «  A(K2) 

A (K2)  -  AK 
El'.  -  8 (KK) 

B(KK)  -  B(K2) 

B(K2)  -  BK 
KK  -  KK  +  INC 
K2  -  KSPAN  +  K2 
IF  (K2.LT.KS)  GO  TO  370 
380  K2  -  K2  -  NP(J) 

J  -  J  +  1 
K2  •  NP(J+1>  +  K2 
IF  (K2.GT.NP(J) )  GO  TO  380 
J  -  1 

3S0  IF  (KK.LT.K2)  GO  TO  370 
KK  -  KK  +  INC 
K2  -  KSPAN  +  K2 
IF  (K2.LT.KS)  GO  TO  390 
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IF  (KK.LT.KS)  GO  TO  380 
JC  -  K3 
GO  TO  440 


PERMUTATION  FOP  MULTIVARIATE  TRANSFORM 

100  K  «  KK  +  JC 
110  AK  -  A  (UK) 

ACKIO  ■  A(K2) 

ACK2)  -  AK 
BK  -  BCKK) 

BCKK)  -  B(K2) 

B(K2)  -  BK 
KK  -  KK  +  INC 
K2  -  K2  +  INC 
IF  (KK.LT.K)  GO  TO  410 
KK  »  KK  +  KS  -  JC 
K 2  -  K2  +  KS  -  JC 
IF  CKK.LT.NT)  GO  TO  400 
K2  «  K2  -  NT  +  KSPfiN 


KK 

-  KK  -  NT 

+  JC 

IF 

(K2.LT.KS) 

GO 

TO 

400 

K2 

-  K2  -  NP ( J) 

J  * 

'  J  +  1 

K2 

-  NPCJ+l) 

+  K2 

IF 

(K2.GT.NPCJ)> 

GO 

TC  420 

J  ■ 

■  1 

IF 

(KK.LT.K2) 

GO 

TO 

400 

KK 

-  KK  +  JC 

K2 

*  KSPAN  + 

K2 

IF 

(K2.LT.KS) 

GO 

TO 

430 

IF 

(KK.LT.KS) 

GO 

TO 

420 

JC 

-  K3 

IF 

(2>:KT+1.GE 

.  M) 

RETURN 

KCPNM  «  HP CKT+l) 

PERMUTATION  FOR  SOUARE-FPEE  FACTORS  OF  N 

J  -  M  -  KT 
NFACCJ+l)  «  1 

45 C  NFAC(J)  -  NFACCJ)*NFAC(J+1) 

J  •  J  -  1 

IF  (J.NE.KT)  GO  TO  450 
KT  •  KT  +  1 
NN  -  NFAC(KT)  -  1 
JJ  -  0 
J  -  0 
GO  TC  480 
460  JJ  -  JJ  -  K2 
'">  -  KK 
•  K  ♦  1 
KK  -  NFAC(K) 

470  JJ  -  KK  +  JJ 

IF  (JJ.GE.K2)  GO  TO  460 
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n  r>  n  r>  n  r> 


NPCJ)  -  JJ 
480  K2  -  NFAC(KT) 

K  «  KT  +  1 
KK  -  NFfiC  (IQ 
J  -  J  +  1 

IF  CJ.LE.KM>  CO  TO  470 

DETERMINE  THE  PERMUTATION  CYCLES  OF  LENGTH  GREATER  THAN  1 

J  -  0 


GO 

TO  506 

K  - 

KK 

KK 

*=  NP  CK) 

NP  CIO  -  -KK 

IF 

(KK.NE. J) 

GO 

TO 

496 

K3 

-  KK 

J  ■ 

J  +  1 

KK 

-  NPCJ) 

IF 

(KK.LT.0) 

GO 

TO 

500 

IF 

(KK.NE. J) 

GO 

TO 

4S3 

NPC 

J)  -  -J 

IF 

CJ.NE.NI!) 

GO 

TO 

530 

MAXF  «  INC*MAXF 

REORDER  A  AND  G.  FOLLOWING  THE  PERMUTATION  CYCLES 

GO  TO  576 
510  J  -  J  -  1 

IF  <NP(J>.LT.0)  GO  TO  5 18 
JJ  -  JC 

520  KSPAH  »  JJ 

IF  CJJ.GT.KAX=)  KSPAN  -  MAXF 
JJ  -  JJ  -  KSPAH 
K  -  NP ( J) 

kk  -  jc*;:  +  i  +  jj 

K 1  -  KK  +  KSPAH 
K2  *  0 

530  K2  -  K2  +  1 

AT (K2)  -  A  (K 1 ) 

BTCi:2)  «  BCK1) 

K 1  *  K l  -  INC 
IF  (Kl.NE.KK)  GO  TO  530 
540  K1  -  KK  +  KSPAN 

K2  -  K1  -  JC’i;(K+NP CK) ) 

K  •  -NP (K) 

550  A ( l<  1 )  •  ACK2) 

BCIC 1)  -  BCK2) 

K 1  -  K1  -  INC 
K2  ■  K2  -  INC 
IF  (Kl.NE.KK)  GO  TO  550 
KK  •  K2 

IF  (K.NE.J)  GO  TO  540 
Kt  -  KK  +  KSPAN 

K2  -  0 
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IC2  ■  K2  +  1 
A(Kl)  -  AT (K2) 

B(K!)  -  BTCK2) 

Kt  •  Kl  -  INC 

IF  (Kl.NE.KK)  GO  TO  560 

IF  (JJ.NE.0)  GO  TU  520 

IF  (J.NE.l)  GC  T?  GIG 

J  -  K3  +  1 

NT  -  NT  -  KSPNN 

I  •  NT  -  INC  +  1 

IF  (HT.GE.0J  GC  TC  510 

RETURN 

END 


APPENDIX  C 


Winograd  Fourier  Transform  Algorithm  Program 
This  appendix  contains  the  driver  and  algorithm 
subroutine  for  executing  and  timing  the  Winograd  Fourier 
Transform  Algorithm  (WFTA).  The  program  was  developed  by 
McClellan  and  Nawab  (McClellan  and  Nawab,  1979).  The 

program  consists  of  six  subroutines  called  in  the  following 
order. 

INISHL  -INISHL  is  called  to  initialize  the  algorithm.  It 
decomposes  N  into  four  relatively  prime  factors, 
computes  the  multiplication  coefficients,  and 
generates  the  indices  for  the  input  and  output 
permutations.  INISHL  needs  to  be  performed  only 
when  a  new  sequence  length  is  to  be  transformed. 
PERM1  -PERM1  permutes  the  input  sequence  in  arrays  XR 
and  XI  and  stores  the  reordered  sequence  in  arrays 
SR  and  SI. 

WEAVE1  -WEAVE1  performs  the  input  additions  for  the  short 
transforms.  WEAVE1  contains  a  module  for  each 
factor's  short  transform. 

MULT  -MULT  performs  the  nested  multiplications. 

WEAVE2  -WEAVE2  performs  the  output  additions. 

PERM2  -PERM2  permutes  the  coefficients  in  arrays  SR  and 

SI  and  stores  them  in  arrays  XR  and  XI. 

The  variables  passed  are  defined  as  follows. 

N  -N  is  an  integer  variable  specifying  the  length  of  the 
input  sequence  x(n)  and  the  output  sequencr  X(k). 
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XR  -XR  is  a  real  array  storing  the  real  components  of  the 
original  sequence  and  final  output  sequence.  It  must 
have  length  of  at  least  N. 

XI  -XI  is  a  real  array  of  dimension  N.  It  stores  the 
imaginary  components  of  the  input  and  output  sequence. 

INIT  -INIT  is  an  integer  value  which,  if  set  to  zero, 
indicates  that  INISHL  should  be  called  to  perform  the 
initialization . 

IERR  -IERR  is  an  integer  value  returned  by  the  subroutine  to 
indicate  an  error  condition.  It  can  have  the  following 
values : 

IERR-0:  no  errors 

IERR— 1:  N  cannot  be  factored  properly 

IERR— 2:  not  initialized  for  this  value  of  N. 

SR  -SR  is  a  real  array  of  length  (ND1)(ND2)(ND3)(ND4) .  It 
stores  the  real  component  of  the  intermediate  results. 

SI  -SI  is  a  real  array  of  length  (ND1) (ND2) (ND3)(ND4) .  It 
stores  the  imaginary  component  of  the  intermediate 
results . 


DIMENSION  INDXl (504) . INDX2 (504) 

D IMENS ION  SR (792) . S I  (792) . COEF (792) 

DIMENSION  XR (564) . XI (5G4) 

DIMENSION  XAVG(4) 

READ (5.*)  NUMTIM 
DO  999  INUM=1> NUMTIM 
READ (5**0  T.DELT.M.LIST 
URITE(6»60l)  T.DELT.M 

601  FORMAT ( 1H l . ' TEST  UITH  T  -'.F10.2.'  DELT  «  ' .F 10.2.  'MULTIPLES' . 15) 

LCOUNT  -  0 
K  -  0 

111  CONTINUE 

LCOUNT  -  LCOUNT  +  1 
INVRS  •  0 
IN  IT  -  0 
101  CONTINUE 
K  -  K  +  1 

IF(K.GT. 1)  IN  IT  -  1 

N  -  (T/DELT)  +0.4 

DELF  -  1.0/T 

PI  -  3. 141592653096 

E  -  2.71028 

FO  -  25.0 

DO  10  I-l.N 

T1  «  ( l-l)*DELT 
ARG  »  2.0*PI*FO*T1 
TE  -  -1.0*T1 
XI ( I)  -  G.0 

10  XR ( I )  ■  (E**TE)*COS(ARG) 

CALL  UFTA(N.XF:.XI.  INIT,  IERR.SR.SI.COEF.M.  INDXl. INDX2. XOUT) 

IF ( IERR. NE . 0)  GO  TO  900 
PR INT*. ' XCUT' . XOUT 
XAVG (LCOUNT)  -  XOUT 
DO  30  I-i.KPTS 

30  XR( I)  -  <(XR(l)**2)  +  (XICI)**2))  **0.5 

PRINT*.  N 

IF (LCOUNT.LT, 3)  GO  TO  III 
TAVG  -  0. 

DO  599  J-1.3 

TAVG  -  TAVG  +  XAVG(J) 

599  CONTINUE 
TAVG  -  TAVG/3 

PRINT*.  'AVG  XOUT  ■  '.TAVG 
IF(K.LE.l)  GO  TO  101 
IFCLIST.NE. 1)  GO  TO  999 
URITE(6.600) 

600  FORMATC  1' .20X.  'MAG' .  16X. '  IMAG' .  16X.  16X./) 

URITEC6.700)  (I.XR(I).Xl(l).  I-l.N) 

780  FORMAT (5X. I5.5X.E14.7.5X.E14.7.5X) 

999  CONTINUE 
60  TO  990 

900  PRINT*.' IERR  •  '.IERR 
990  STOP 
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SUBROUTINE  UFTACN.XR.XI, INIT, IERR.SR.SI.COEF.M. INDX1. INDX2.X0UT) 
REAL  XRUJ.Xim 
C 

C  THE  FOLLOWING  TUG  CARDS  MAV  BE  CHANGED  IF  THE  MAXIttJM 

C  DESIRED  TRANSFORM  LENGTH  IS  LESS  THAN  584G 

REAL  SR(l),SI(l).COEF(l) 

INTEGER  INDX1 ( l) . INDX2( D 

c 

COMMON  NA.N8.NC,ND.NDl,ND2,ND3,ND4 
C 

C  TEST  FOR  INITIAL  RUN 

XIN  -  SECNDSC0.B) 

IFUNIT.EQ.0)  CALL  INISHL(N,COEF,XR,XI.  INDX1.  INDX2.  IERR) 

C 

IF(IERR.LT.B)  RETURN 
M»NA*NB*NC*ND 
IF(M.EQ.N)  GO  TO  100 
IERR-- 2 
RETURN 

PROGRAM  NOT  INITIALIZED  FOR  THIS  VALUE  OF  N 

00  NMULT«=ND1*ND2*ND3*ND4 

PERMUTE  THE  INPUT  DATA 

CALL  PERM1 (SR.SI.XR.XI, INDX1) 

DO  THE  PRE-WEAVE  MODULES 

CALL  UEAVEKSR.SI) 

DO  THE  NESTED  MULTIPLICATIONS 

CALL  MULT(SR.SI.COEF.NMULT) 

DO  THE  POST-UEAVE  MODULES 

CALL  UEAVE2(SR.SI) 

PERMUTE  THE  OUTPUT  DATA 

CALL  PERM2CSR.SI.XR.XI. INDX2) 

XOUT  -  SECNDSCXIN) 

RETURN 

END 

SUBROUTINE  INISHUN.COEF.XR.XI. INDX1, INDX2. IERR) 

REAL  COEF(l).XP.(l).XI(n 

INTEGER  Sl.S2.53.54. INDX1 ( I). INDX2C  P.Pl 

REAL  C03 (3) . C04C4) . COB (8) . C09 ( 11) . CO  IS ( IB) . CD l ( 18) . CD2 ( 1 1) . 

ICD3<9) *CD4(S) 
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noon  non 


COMMON  NA,NB,NC,ND,ND1,ND2,ND3,ND4 

DATA  STATEMENTS  ASSIGN  SHORT  DFT  COEFFICIENTS. 

DATA  CG3/1 .  0.  - 1 . 5.  -0 .  8660254038/ 

DATA  004/1.0,1.0,1.0,1.0/ 

DATA  CO8/1.0, 1.0. 1.0. ~1.0, 1. 0.-0. 7071067812. 

1  -1.0,0.7071067812/ 

DATA  009/1  . 0, - I . 5. -0 .  8660254033, 0.5.0. 7660444431 . 

1  -0. 1736481777.0.9396926203.-0.6427076097, 

2  -0 . 934807753 . 0 . 342020 1 433 , -0 . 6660254033/' 

DATA  CO16/1.0. 1.0. 1.0. -1.0. 1.0.-0.7071067812.-1.0. 

1  0.7071067812. 1.0.0. 5411S61B01. -0.7071067812, 

2  -0.5411961001.-1.0,-1.306562965,0.7071067812. 

3  1 . 306562965,-0 . 9238795325.0 . 3E26B34324/ 

DATA  CD1/1B*1./ 

DATA  CD2/ll*l./ 

DATA  CD3/1 .0. - 1 . 1666666667, -0 . 4409585518.0 . 7343022012. 

1  0.790 1564685 , -0 . 3408729306,-0 . 87484229 l . 

2  0.0558542673,0.5339693603/ 

DATA  CD4/1 . 0, - 1 . 25, - 1 . 538841769,0 . 55S0 169944. 0 . 363271264. 
1  0.5877852523/ 

FOLLGUING  SEGMENT  DETERMINES  FACT0R5  OF  N  AND  CHOOSES 
THE  APPROPRIATE  SHORT  DFT  COEFFICIENTS. 

IERR-0 
NDl-i 
NA-1 
NB-1 
ND2-1 
NC-1 
ND3-1 
ND«1 
ND4*  1 

IF (N.LE.01  GO  TG  190 
IF(16*(N/16) .EO.N)  GO  TO  30 
IF (8*(N/8) .EO.N)  GO  TO  40 
IF(4*(N/4).EQ.N)  GO  TO  50 
IF (2*(N/2) .NE.N)  GO  TO  70 
ND1-2 
NA*2 

CD1 (2)*1 .0 
GO  TO  70 

30  ND1-18 
NA-16 

DO  31  J-1,18 

31  CDI(J)-C016(J) 

GO  TO  70 

40  ND1-8 
NA-8 

DO  41  J-1.8 

41  CDl(J)-COBCJ) 

GO  TO  70 
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DO  51  J-t.4 
CD  1 ( J) -C04C J) 

IF (3c<(N/3)  .NE.N)  GO  TO  120 
IF(9n:(N/S>  .EO.H)  GO  TO  100 
ND2-3 
NB-3 

DO  71  >1.3 
CD2(J)«C03(J) 

GO  TO  120 

ND2-11 

NB-9 

DO  110  >1. 11 
CD2(J)«C09(J) 

IF (?*CN/7) .NE.N1  GO  TO  1S0 

ND3-9 

NC-? 

IF(5*(N/5) .NE.N)  GO  TO  190 

ND4-6 

HD-5 

IF (M.EQ.N)  GO  TO  250 
PRINT*. 'THIS  N  DOES  NOT  WORK' 
IERR— 1 
RETURN 


NEXT  SEGMENT  GENERATES  THE  DFT  COEFFICIENTS  AND 
THE  FLAG  ARRAY. 

>1 

DO  300  N4-1.ND4 
DC  30G  N3-1.ND3 
DO  300  N2-1.HD2 
DO  30E  Nl-l.NDl 

COEF  C J)  -CI>  1  <H  l )  *CD2(N2)*CD3  CN3) *CD4(N4) 

J-J+l 

CONTINUE 

FOLLOWING  SEGMENT  FOR  INPUT  INDEXING. 

51- 0 

52- 0 

53- 0 

54- 0 

55- 0 
J-l 

NU«NB*NC*ND 

NV-NA*NC*ND 

NW«NA*NB*ND 

NY-NA*NB*NC 

K-l 

DO  440  N4-1.ND 
DO  430  N3-1.NC 
DO  420  N2-1.NB 
DO  410  Nl-l.NA 
IF(K.LE.N)  GO  TO  40B 


K-K-N 
GO  TO  405 
400  IN&Xl(J)«K 
J-J+l 

410  K-K+NU 
420  K-K+NV 
430  K-K+NU 
440  K-K+NY 
C 

C  FOLLOUING  SEGMENT  FOR  OUTPUT  INDEXING 

M-l 

IF(NA.EQ.l)  GO  TO  530 
520  P1«M*NU-1 

IF((P1/NA)*NA.EQ.P1)  GO  TO  510 
M-tt+1 
GO  TO  520 
510  Sl-Pl+1 

530  IF(NB.EQ.l)  GO  TO  540 
M-l 

550  P 1 «M*NV- 1 

IF( (Pl/NBltfNB.EC.Pl)  GO  TO  5GG 
M-M+l 
GO  TO  550 
560  S2-P1+1 

540  IFCNC.EQ.n  GO  TO  630 
11=  1 

620  Pl«M*NU-l 

IFUPl/NC)*NC.EO.Pn  GO  TO  610 
M-rt+1 
GO  TO  620 
610  S3-P1+1 

630  IFlND.EQ. 1)  GO  TO  660 
M=  1 

640  P 1 «M*NY- 1 

IFtlPl/'ND)*ND.EO.Pl)  GO  TO  650 
M-M+l 
GO  TO  640 
650  S4-P1+1 

660  J-l 

DO  810  N4-I.ND 
DO  810  N3-1.NC 
DC  810  N2-1.NB 
DO  810  Nl-l.NA 

INDX2  ( J) -S l*(N 1- U +S2*(N2- 1 ) +53*CN3- 1) +S4*(N4- 1) +1 
900  IF(INDX2(J) .LE.N)  GO  TO  910 
INDX2CJ) -INDX2(J)-N 
GO  TO  900 
910  J-J+l 
810  CONTINUE 
RETURN 
END 

SUBROUTINE  PERM1 CSR.SI.XR.XI. INDX1) 

COMMON  Nfi.NB.NC.ND.NDl.ND2.ND3,ND4 
REAL  XRCl),X[m,SRU),SI(l) 


INTEGER  INDXl(l) 

J-t 

K-l 


INC1-ND1-NA 

INC2-ND1*CND2-N3) 

INC3-ND1*ND2*(ND3-NC) 

DO  40  N4-I.ND 
DO  30  N3-1.NC 
DO  20  N2-1.NB 
DO  10  Nl-l.NA 
£R( J) -XR( INDX1 (K) 5 
SI(J)*XI( INDX1 (K) ) 

K-K+l 
10  J-J+l 

21  J-J+INC1 

33  J-J+INC2 

43  J-J+INC3 

RETURN 
END 

SUBROUTINE  PERM2CSR.SI.XR.XI. INDX2) 
CGMMOH  Nft-.NB.NC.ND/ ND 1 . ND2.ND3. ND4 
REAL  XR(l)»XI(l).SR(l).SI(l) 
INTEGER  INDX2C 1) 

J-l 

K-l 

INCl-NDl-Nfi 

INC2-ND1*(ND2-NB) 

INC3-ND1*ND2*(ND3-NC) 

DO  40  N4-1.ND 
DO  30  N3-1.NC 
DO  20  N2-1.NB 
DO  10  Hl-l.NA 
XR ( INDX2 (K) ) -SR  C J) 

XI ( INDX2CK) ) -SI  (J) 

K-K+l 
10  J-J+l 

20  J-J+INC 1 

30  J-J+INC2 

40  J-J+INC3 

RETURN 
END 

SUBROUTINE  WEAVE l (SR.  SI) 

COMMON  Nft. NB. NC. ND, ND 1 . ND2.ND3, ND4 
REAL  Q(B).T(16),SR(l).SI(l) 

IF (Nfl.EQ.  1)  GO  TO  300 
IFCNA.NE.2)  GO  TO  B00 


C 

C 

C 

C 

C 

C 


THE  FOLLOWING  CODE  IMPLEMENTS  THE  2  POINT  PRE-LEAVE  MODULE 


HLUP2-2*(ND2-NB) 


NLUP23-2*ND2*(ND3-NC) 

NBASE -1 

DO  240  N4-1,ND 
DO  230  N3-1.NC 
DO  220  N2-1.N8 
NR1-NBASE+1 
T0-SR (NBASE ) +SR (NR  1 ) 

SR (NR  1) -SR (NBASE) -SR (NR  I ) 

SR(NEASE) -T0 
T0»SI(NBASE)+SI(NR1) 

S I (NR  1 ) -SI (NEASE) -S [ (NR  1) 

SI  (N8ASE) -T0 
220  N9ASE-NBASE+2 
230  NBASE-NBASE+NLUP2 
246  NBASE -NBASE+NLUP23 

800  IF(NA.NE.B)  GO  TO  1600 
C 

c 

C  THE  FCwLOUING  CODE  IMPLEMENTS  THE  B  POINT  PRE-UEAVE  MODULE 
C 

C 

NLUP2«8*(ND2-NB) 

NLUP23«S*ND2*(ND3-NC) 

NBASE-1 

DO  840  N4-1.ND 

DO  830  N3-I.NC 

DO  820  N2-I.NB 

NRl-NBASE+l 

NR2-NR1+1 

NR3-NR2+1 

NR4-NR3+1 

NR5-NR4+1 

NR6-NR5+1 

NR?-NR6+1 

T3«SR(NR3)+SR(NR?) 

T? -SR (NR3) -SR (NR?) 

T0-SR (NBASE) +SR (NR4) 

SR (NR4) -SR (NEASE ) -SR (NR4) 

Tl-SR(NR1)+SR(NR5) 

T5-SR (NR  1 ) -SR (HR5) 

T2-SR (NR2) +SR (NR6) 

SR (NR6) -SR (NR2) -SR(NRS) 

SR (NBASE) -T0+T2 
SR(NR2) -T0-T2 
SR(NRl) -T1+T3 
SR(NR3)-T1-T3 
SR(NR5) -T5+T7 
SR (NR?) -T5-T7 
T3-SI (NR3)+SI (NR7) 

T7-SI (NR3)-SI (NR?) 

T0-S I (NBASE) +S I (NR4) 

S I (NR4) -S I (NBASE ) -S I (NR4) 


f 


T1-SI(NR1)+S!(NR5) 

T5-SI(NR1)-SI(NR5) 

T2*SI (NR2)+SI (NR6) 

SI  (NR6) -SI (NR2J -S I (HRS) 

SI (NEftSE) -T0+T2 
S I (NR2) *T0-T2 
SI(NR1)«T1+T3 
S I (NR3) -T1-T3 
SI (NR5) -T5+T7 
SI (NR?) -T5-T7 
820  NBASE-NBASE+8 
830  NBASE-NBASE+NLUP2 
840  NBASE-NBASE+NLUP23 
1600  IFCNA.NE. 16)  GO  TO  300 
C 

C 

C  THE  FOLLOWING  CODE  IMPLEMENTS  THE  16  POINT  PRE-UEAVE  MODULE 
C 

C 

NLUP2«1E*(ND2-NB) 

NLUP23 - 1 8>,'l  ID2>!-  (ND3-NC ) 

HBASE-1 

DO  1640  N4-1.ND 

DO  1630  N3«1,NC 

DO  1620  N2-I.NB 

NRl-NEASE+l 

NR2-NE1+1 

NR3-NR2+1 

NR4-I1R3+1 

NR5-NR4+1 

NR6-NE5+1 

NR7-NR6+1 

NRB-NR7+1 

NR9-NR8+1 

NR10-NR9+1 

NR11-NR10+1 

NR12-NR11+1 

NR13-NR12+1 

NR14-NR13+1 

NR15-NR14+1 

NR16-NR15+1 

NR17-NR16+1 

JBASE-NBASE 

DO  1645  J-1.8 

T(  J) -SR( JBASE)+5R( JBASE+8) 

TC J+8) -SR ( JBASE) -SR ( JBASE+8) 

JBASE-JBASE+1 
1645  CONTINUE 

DO  1650  J-l.4 
Q(J) *T( J)+T( J+4) 

QCJ+4) -T(J)-T(J+4) 

1650  CONTINUE 
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SR*(NBASE)  -Q(  1)+QC3) 
SR(NR2)-Q(1)-Q(3) 

SR(NRl) «Q(2)+0(4) 
SR(NR3)-Q(2)-Q(4) 
SR(NR5)-0(6)+0(B) 
SR(NF.7)-Q(6)-Q(8) 

SR(NR4) *Q(5) 

SR(NR6)-Q(7) 

SR(NR8)-T(9) 

SR(NR9)-T(10)+T(16) 

SR (NR  15) -T( 10)-T( 16) 

SR(NR13) -T( 14)+T( 12) 
SR(NR11)-T(14)-T(12) 

SR (NR  17) “SR (NR  1 1 ) +SR (NR  15) 

SR (NR  16) -SR(NR9)+SR(NR13) 

SR (NR  IB) -T (11) +T (15) 

SR (NR  14) -T( 1 1) -T( 15) 

SR (NR  12) *T( 13) 

JBOSE-NEftSE 

DO  1745  J- 1.8 

T ( J)  -S  I  ( JEftSE) +S I  ( JEASE+-8) 

T ( J  +8 )  -SI  (JBP.SE)-SI  (JBfiSE+8) 
JBfiSE-JBflSE+1 
1745  CONTINUE 

DO  175e  J-l.4 

0(J)-T(J)+T(J+4) 

0(J+4)-T(J)-T(J+4) 

1750  CONTINUE 

S I (NBPSE) »Q( l)+Q(3) 
SI(NR2)-0(1)-Q(3) 

S I  (NR  1 )  -0  (2)  +0  (4) 

SI (NR3) *Q(2)-Q(4) 

SI (Nr5)-Q(6)+0(8) 

S I (NR7) -Q (6) -Q (6) 

S I (NR4) -Q (5) 

S I (NR6) *0(7) 

SI (NRB) -T(9) 

S I (NR9) «T(10)+T(16) 

SI (NR15) -T( 10)-T( 16) 

SI (NR13) -T( 14)+T( 12) 

SI (NR1 1)-T( 14)-T( 12) 

SI (NR  17) -SI (NR  1 1)+S I (NR 15) 
SI (NR  16) -SI (NR9) +SI (NR  13) 
SI(NR10)-T(11)+T(15) 

SI (NR  14) -T( 1 1)-T( 15) 
SI(NR12)-T(13) 

1620  NBASE-NBASE+1B 
1630  NBRSE-NBASE+NLUP2 
1640  NBASE-NBASE+NLUP23 
300  IF(HB.EO.t)  GO  TO  700 
IF(NB.NE.3)  GO  TO  900 
C 


C  THE  FOLLOWING  COIJE  IMPLEMENTS  THE  3  POINT  PRE-WEAVE  MODULE 
C 

c 

NLUP2«2*ND1 

NLUP23-3*ND1*(ND3-NC) 

N3ASE-1 
NQFF-ND1 
DO  340  N4-I.ND 
DO  330  N3-1.NC 
DO  310  N2-1.ND1 
NR1-NBASE+NOFF 
NR2-NR1+H0FF 
Tl«SR(NRl)+SR(NR2) 

SR(NBASE) -SR(NBASE)+T1 
SR (NR2) -SR (NR  1 ) -SR  CNR2) 

SR(NR1)-T1 

Tl-SI(NR1)+SICNR2) 

S I  (MEASE) “SI (NBASE)+T1 
S I (NR2) -SI (NR1) -SI (NR2) 

SI(NR1)-T1 
310  NBASE-NBASE+1 
330  M2 ASE *>NE  AEE+NLUP2 
340  NEASE-NBASE+NLUP23 
900  IFCNB.NE.9)  GO  TO  700 
C 

c 

C  THE  FOLLOWING  CODE  IMPLEMENTS  THE  9  POINT  PRE-WEAVE  MODULE 
C 

c 

NLUP2* 10*ND 1 

NLUP23- 1 l*NDl*(ND3-NC) 

NBASE-1 

NOFF-ND1 

DO  943  N4-I.ND 

DO  930  N3-l.N0 

DO  910  N2-1.NDI 

NR1-NCASE+N0FF 

NR2-NR1+N0FF 

NR3-NR2+N0FF 

NR4-HR3+N0FF 

NR5-NR4+N0FF 

NR6-NR5+N0FF 

NR7-NR6+N0FF 

NR8-NR7+N0FF 

NR9-NR8+H0FF 

NR10-NR9+NOFF 

T3-SR(NR3)+SRCNR6) 

T6-SR (NR3) -SR (NR6) 

SR(NBASE) -SR (NBASE) +T3 
T7-SR (NR?) +SR (NR2) 

T2-SR(NR?)-SR(NR2) 


C-12 


nnnnnn 


SR(NR2) -TG 

T1 *SR (NR  1 ) +SR (NR8) 

T8«SR(NR1)-SR(NR8) 

SR(NRl) -T3 
T4«SR(NR4)+SR(NR5) 

T5-SR (NR4) -SR (NR5) 

SR (NR3) -TI+T4+T? 

SR(NR4)«T1-T7 
SR (NR5) -T4-T1 
SR (NR6) -T7-T4 
SR (NR  10) -T2+T5+TB 
SR (NR?) -T8-T2 
SR(NRS)-T5-T8 
SR(NR9) -T2-T5 
T3-SI (NR3)+SI (NR6) 

T6«SI(NR3)-SI(NR6) 

S I (NBASE) -S I (NBASE) +T3 
T7-SI (NE?)+SI(NR2) 

T2-S I (NR?) -S !  (NR2) 

S I (NR2) -TG 
T 1  “S I  (NRD+SI  (NRB) 

TE-SI (NRl)-SI  (NR8) 

SKNRD-T3 
T4-SKNR4)  +SKNR5) 

T5-SI(NR4)-SI(NR5) 

SI (NRS) -T1+T4+T7 
SI(NR4) -T1-T7 
S I (NR5) -T4-T1 
S I (NRS) -T7-T4 
SI  (NRIG) -T2+T5+T8 
SI(NR?)-T8-T2 
S I  (MRS) -T5-T8 
S I  (NRS) *12-15 
910  NBASE-NBASE+1 
930  NBASE-NBASE+NLUP2 
940  NBASE -NEASE+NLUP23 
700  IF (NC . NE . 7)  GO  TO  500 

THE  FOLLOUING  CODE  IMPLEMENTS  THE  ?  POINT  PRE-UEAVE  MODULE 

#OWOWoMc>Mafc*3^**^*^*>K*Ho|ok>KJtQtoMoK>Mofc^>M«K*:i<>K*^>Wo4olot<*>IOK 

N0FF-ND1*ND2 
NBASE- 1 
NLUP2-8*N0FF 
DO  740  N4-t,ND 
DO  710  Nl-l.NOFF 
NR1-NBASE+NOFF 
NR2-NR1+N0FF 
NR3-NR2+N0FF 
NR4-NR3+N0FF 
NR5-NR4+NOFF 
NR6-NR5+N0FF 


C-13 


n  ci  n  n  n 


NR7-NR6+N0FF 
'  NR8-NR7+N0FF 

Tl-SR (NR  1 ) +SR (NR6) 

.  t6-sr(npi)-sr(nr6) 

T4-SR (NR4) +SR (NFS) 

T3-SR (NR4) -SR (NR3) 

T2-SR(NR2)+SR(NR5) 

T5-SR (NR2) -SR (NR5) 

SR(MR5) -T6-T3 
SR  (NR2)  -T5+T3+T6 
SRCNR65-T5-T6 
SR(NR8)«T3-T5 
SR(NR3)«T2-Tt 
SR (NR4) -T1-T4 
SR (NR?) -T4-T2 
T1-T1+T4+T2 

SR (NBASE) -SR (NBftSE) +Tt 

SR (NR  1 )  -T1 

Tl-SI (NR1)+SI(NR£) 

TG-S I (NR  1) -S I (NR6) 

T4-S I (NR4) +S I (NR3) 

T3=S I (NR4) -S I (NR3) 

T2-S I (NR2) +S I (NR5) 

T5-SI (NR2) -S I (NR5) 

SI(NR5)-T6-T3 
S I (NR2) -T5+T3+T6 
SI(NR6)-T5-T6 
SI(NRe)-T3-T5 
SI(NR3)-T2-T1 
S I (NR4) -T1-T4 
SI(NR7)«T4-T2 
Tt -11+14+12 

SI (NEASE)-SI (NBASE)+T1 
SKNRD-Tl 
710  NEfiSE«N5fiSE+l 
740  NBfiSE-NCfiSE+NLUP2 
500  IF (ND . NE . 5)  RETURN 
C 

THE  FOLLOWING  CODE  IMPLEMENTS  THE  5  POINT  PRE-UEAVE  MODULE 


NGFF-ND1*ND2*ND3 
NBASE- l 

DO  510  Nl-l.NOFF 
NR1-NBASE+NOFF 
NR2-NR1+N0FF 
NR3-NR2+H0FF 
NR4-NR3+N0FF 
NR5-NR4+N0FF 
T4-SR (NR  1 ) -SR (NR4) 
T1  -SR (NR  1 ) +SR (NR4) 


C- 14 


o  o  n  n  o  o  o 


T3-SR(NP3)+SR(NR2) 

T2«SR(NR3)-SR(NR2) 

SR  (NR3) =Tt-T3 
SRCNRl) *T1+T3 

SRCNBASE)  *SF:CNBASE)+SR(NR1) 

SRCNR5)-T2+T4 

SR(NR2)«T4 

SRCNR4)-T2 

T4=S I (NR  1 ) -S I (NR4) 

T1«SI  (NRD+SI  (NR4) 

T3*SI (NR3)+SICNR2) 
T2«SI(NR3)-SI(NR2) 

SI(NR3)-T1-T3 

SKNRD-T1+T3 

S I (NBASE) -S I (NBASE) +S I  CNR 1 5 
SI (NR5) “T2+T4 
SICNR2)-T4 
SI(NR4)«T2 
510  NBASE-NBASE+1 
RETURN 
END 

SUBROUTINE  MULT  (SR.SI.COEF.NMULT) 
COMMON  NA. NB. NC. ND. ND 1 . ND2.ND3. ND4 
REAL  SR(l).SKl).COEFCl) 

DO  10  J-l.NKULT 

SP.(J)-SR(J)*COEFCJ) 

SICJ)'SI(J)*COEFU) 

10  CONTINUE 
RETURN 
END 

SUBROUTINE  UEAVE2CSR.SI) 

REAL  SRCD.SICt) 

COMMON  HA. NB. NC. HD. ND  1 . ND2.ND3, ND4 
REAL  0(8) .T( 1G) 

IFCND.NE.5)  GO  TO  700 


THE  FOLLOWING  CODE  IMPLEMENTS  THE  5  POINT  POST-WEAVE  MODULE 

!WOMoWotc>Mo(:>^^>M;;K>WoMoK)W^:>MO*3K)WctojoKX:>W<:K^>K^oK*>Mofc5Woto|<>tc 

N0FF«ND1*ND2*ND3 

NBASE-1 

DO  510  Nl-l.NOFF 

NR1-NBASE+H0FF 

NR2-HR1+N0FF 

NR3-NR2+N0FF 

NR4-NR3+N0FF 

NR5-NR4+N0FF 

T 1 »SR ( NBASE ) +SR ( NR  1 ) 

T3-T1-SRCNR3) 

.  T1-T1+SRCNR3) 

T4-SICNR2)+SI(NR5) 


C-15 


n  n  r>  n  n  a  n 


T2*S I (NR4) +S I (NR5) 

'  SR(NR1)-T1-T4 
SR4-T1+T4 
SR2-T3+T2 
SR(NR3) «T3-T2 
T1«SI  (HBPiSE)  +S I  (NR  U 
T3-T1-SKNR3) 

T1-T1+SKHR3) 

T4«SR (NR2) +SR (NR5) 

T2«SR (NR4) +SR (NR5) 

S I (NR  1 ) «T1+T4 
SI (HR4) ■T1-T4 
SI(NR2)«T3-T2 
SI(NR3)«T3+T2 
SR(NR2)-SR2 
SRCNR4) *SR4 
510  NDfiSE-HBftSE+1 
700  IFCNC.NE.7)  GO  TO  300 

)|OMOM3nK)K*>|gKiMo|o»yi)alaMoK*]K*>l<<toKiMO**]|oyyiMoMolo!^ 

THE  FOLLOUING  CODE  WLEf^NTS  THE  7  POINT  POST-LEftVE  MODULE 


i(o»oK*^o|o|oK)Wt3loK)t^i)oMoK^i(oMc)(oK*^^MoJoto^*HoW^)|o(o)oK**iMoloiolo(o|« 


N0FF-ND1*ND2 

NBASE*1 

NLUP2*8*N0FF 

DO  740  N4-1.ND 

DO  710  Nl=l.NOFF 

NRl-NBfiSE+NOFF 

NR2-NR1+H0FF 

NR3-NR2+H0FF 

NR4-MR3+N0FF 

NR5-NR4+NOFF 

NR6-NR5+N0FF 

NR7-NR6+N0FF 

NR8-NR7+N0FF 

Tl-SR(NRl)+SR(NBRSE) 

T2«T1-SR(NR3)-SRCNR4) 

T4«T1+SR (NR3) -SR (NR7) 

T1»T1+SR (NR4) +SR (NR7) 

T6-SI(NR2)+SI(NR5)+SI(NR8) 

T5“SI (NR2)-SI (NR5) -SI (NR6) 

T3*SI (NR2)+SI (NR6)-S I (NR8) 

SRCNRD-T1-T6 

SRS-T1+T6 

SR2-T2-T5 

SR5-T2+T5 

SR(NR4) "T4-T3 

SR(NR3)-T4+T3 

Tl "S I (NR l ) +S I (NBflSE) 

T2-T1-SI(NR3)-SI(NR4) 

T4-T1+S!(NR3)-SI(NR7) 


T1-T1+SICNR4)+SI(NR7) 

T6-SR (NR2) +SR (NR5) +SR (NR8) 

T5«SR(NR2)-SR(NR5)-SR(NR6) 

T3-SR (NR2) +SR (NR6) -SR (NR8) 

SI(NR1)«T1+T6 
SI(NR6)»T1-T6 
SI(NR2)«T2+T5 
SI(NR5)-T2-T5 
SI(NR4) -T4+T3 
SI (NR3) “T4-T3 
SR(NR2)*SR2 
SRCNR5) “SR5 
SRCNR6) «SR6 
710  UBASE-NEASE+1 
740  NEASE-NBASE+NLUP2 
300  IF(NB.EQ.l)  GO  TO  400 
IF(NB.NE.3)  GO  TO  900 
C 

c 

C  THE  FOLLOUING  CODE  IMPLEMENTS  THE  3  POINT  POST-LEAVE  MODULE 
C 

c  iMoloKiMoMt^iKJMot:i|Gta)oK)Wr.KH::WKX^T»^K)Mo|{i|g)o)ok*S-'i)oMo»"ranioK»WoloMo|o^ 

c 

NLUP2-2*ND1 

NLUP23«3*ND1*CND3-NC) 

NBASE-1 

NOFF-ND1 

DO  340  N5-I.ND 

DO  338  N4-1.NC 

DO  310  N2-1.ND1 

NR1-NBASE+NOFF 

NR2-NR1+N0FF 

T 1  "SR  ( NBASE )  +SR  ( NR  1 ) 

SR(NRl) “Tl-SI (NR2) 

SR2-T1+SICNR2) 

Tl*SIiNBftSE)+SI(NRl) 

SI(MRI)-T1+SR(NR2) 

SI CNR2) *T1-SR(NR2) 

SR(NR2)«SR2 
310  NBASE»NBASE+1 
330  NBASE*NBASE+NLUP2 
340  N0ASE*NBASE+NLUP23 
900  IF(NB.NE.9)  GO  TO  400 


C  THE  FOLLOUING  CODE  IMPLEMENTS  THE  9  POINT  POST-UEAVE  MODULE 
C 

c 

NLUP2- 10*ND 1 
NLUP23- I 1*ND t*(ND3-NC) 

NBASE- 1 


N0FF-ND1 

DO1  940  N4-I.ND  • 

DO  930  N3“1,NC 

DO  910  N2“1»ND1 

NR1-NBASE+NOFF 

NR2-NR1+N0FF 

NR3-NR2+N0FF 

NR4-NR3+N0FF 

NR5“NR4+NGFF 

NRE-NR5+N0FF 

NR7-NR6+N0FF 

NR8-NR7+N0FF 

NR9-NRB+N0FF 

NR10-NR9+NOFF 

T3«SR (NBASE) -SR (NR3) 

T7-SR (NBASE )+SR (NR  1) 

SR (NBASE) «SR (NBASE) +SR (NR3)+SR (NR3) 
T6-T3+SKNR10) 

SR (NR3) “T3-S I (NR  10) 

T4-T7+SR ( NP5 ) -SR (HR6 ) 
T1“T7-SR(NR4)-SR(NR5) 

T7“T7+SR (NR4) +SR (HRS) 

SR (NRG) “T6 

Te-S I (NR2) -S I (NR7)-S I (NR8) 

T5-S I (NR2) +S I (NRG) -S I (NR9) 

T2=S I (NR2) +S I (NR7)+S I (NR9) 

SR(NR1)-T7-T2 

SR8-T7+T2 

SR (NR4) “TI-T8 

SR (NR5) “T1+T8 

SR7-T4-T5 

SR2-T4+T5 

T3“SI (NEASE)-SI (NR3) 
T7-SI(NEASE)+SI(NR1) 

S I (NBASE) “S I (NBASE) +S I CNR3J+S I (NR3) 
T6-T3-SR(NR10) 

SI (NR3) “T3+SR (NR  10) 
T4-T7+SI(NR5)-SI(NRG) 

T1“T7-SI (NR4)-SI (NR5) 

T7“T7+SI (NR4)+SI (HRS) 

SI (NR6) “T6 

T8-SR (NR2) -SR (NR7) -SR (NR8) 

T5-SR (NR2) +SR (NRB) -SR (NR9) 

T2 -SR ( NR2 ) +SR ( NR  7 ) +SR  C  NR9 ) 
SKNRD-T7+T2 
SI(NR8)“T7-T2 
SI (NR4) “Tl+TB 
SI (NR5) “Tl-TB 
SI (NR7) “T4+T5 
SI (NR2) “T4-T5 
SR(NR2)“SR2 
SR(NR7)-SR7 
SR(NR8) “SR8 
918  NBASE-NBASE+1 


C-18 


930  NBP*SE-NBASE+NLUP2 
940  NBASE-NBASE+NLUP23 
400  IF(NA.EG.l)  RETURN 

IF(NA.NE.4)  GO  TO  B00 
C 

c  )to|()|otoig«o)oK3)oK>|o>3|oMo<o^;**3f3|oioioK3to^^ 

c 

C  THE  FOLLOWING  CODE  IMPLEMENTS  THE  4  POINT  POST-WERVE  MODULE 
C 

C  *o|oMafc#5|afc#aMot3Wo|oMoWaWotot*>|ctat*)Mo|otat^^^ 

c 

NLUP2»4*(ND2-NB) 

NLUP23«4*ND2*(ND3-NC) 

NBASE-t 

DO  440  N4-UND 
DO  430  N3«1,NC 
DO  420  N2«1,NB 
NR1-NBASE+1 
NR2-NR1+1 
NR3-HR2+1 

TR0-SR (NBflSE) +SR (NR2) 

TR2-SR  (NBflSE) -SR  CNR2) 

TR 1 «SR (NR  I ) +SR  CNR3) 

TR3-SR (NR  1 )  -SR  (NR3) 

TI 1-SI (NR1)+SI  (NR3) 

TI3-SI  (NRl)-SI (NR3) 

SP.  (NBflSE)  -TR0+TR 1 
SR(NR2)-TR0-TRi 
SR(NR1)-TR2+TI3 
SR(NR3) -TR2-TI3 
TI0-S I (NBASE) +S I (NR2) 

TI2-SI  (NBASE)-SI (NR2) 

SI  (NBflSE) -TI0+TI1 
S I (NR2) -TI0-TI 1 
SI(NR1)«TI2-TR3 
SI(NR3)«TI2+TR3 
420  NBASE-N8ASE+4 
430  NBASE -NBASE+NLUP2 
440  NBASE-NEASE+NLUP23 
800  IF(NA.NE.B)  GO  TO  1600 
C 

C 

C  THE  FOLLOWING  CODE  IMPLEMENTS  THE  B  POINT  POST-WEAVE  MODULE 

C  .  . . 

c 

NLUP2-8*(ND2-NB) 

NLUP23-8*ND2*(ND3-NC) 

NBflSE- 1 

DO  840  N4-1.ND 
DO  830  N3-1.NC 
DO  820  N2-I.NB 
NR1-NBASE+1 


C-19 


uuuuuuu 


NR2-NR1+1 
NR3-NR2+1 
NR4-MR3+1 
.  NR5-NR4+1 
NR6-NR5+1 
NR7-NR6+1 

Tl-SR(NBASE)-SR(NRi) 

SR (NBfiSE) -SR  (NBftSE) +SR (NR  1 ) 
SRS-SR (NR2) +S I (NR3) 
SR(NR2)-SR(NR2)-SI (NR3) 
T4*SR(NR4)-SI (NR5) 

T5-SR (NR4) +5 1 (NR5) 
T6-SR(NR7)-SI (HRS) 
T7*SR(NR7)+SI (HRS) 
SR(NR4)-Tl 
SR(NRl) -T4+T6 
SR3-T4-T6 
SR5-TS-T7 
SR (NR?) -T5+T7 
Tl-SI (NBfiSE)-SI  (NR1) 

S I (NBASE) -S I (KBfiSE) +S I (NR  1 ) 
T3-S! (NR2)-SR(NR3) 

S I (NR2) -S I (NR2) +SR (NR3) 

T4-S I (NR4) +SR (NR5) 

T5-SI (NR4)-SR(NR5) 

S I (UR6) -T3 
T6-SR (NR6) +S I (NR7) 
T7-SRCNR6) -SI (NR?) 

SI (NR4) -Tl 
SI(NR1)-T4+T6 
S I (NR3) -T4-T6 
S I (NR5) -T5+T7 
S I (NR7) -T5-T7 
SR(NR3)*SR3 
SR  (NR5) -SR5 
SR (NRG) -SR6 
820  NBflSE-NBflSE+8 
830  NBASE-NEHSE+NLUP2 
840  NBASE-NBASE+NLUP23 
1600  IF (Nft.NE. 16)  RETURN 


THE  FOLLOUING  CODE  IMPLEMENTS  THE  16  POINT  POST-UEAVE  MODULE 


NLUP2-18*(ND2-NB) 

NLUP23«18*ND2*(ND3-NC) 

NBASE-1 

DO  1640  N4-1.ND 
DO  1630  N3-1*NC 
DO  1620  N2-1.NB 
NR1-NBASE+1 


NR2-NR1+1 

NR3-NR2+1 

NR4-NR3+1 

NR5-NR4+1 

NR6-NR5+1 

NR7-NRS+1 

NR8-NR7+1 

NR9-NRB+1 

NR1B-NR9+1 

NR11-NR10+1 

NR12-NR11+1 

NR13-NR12+1 

NR14-NR13+1 

HR  15-NR 14+1 

NR16-NR15+1 

NR17-NR16+1 

TC2)  -SRCNBASE)-SRCNR1) 

SR  CUBASE) -SR (NR  1 ) +SR (NBASE) 

TC4)-SRCNR2)+SICi;R3) 

TC3)-SRCNR2)-SICNR3) 

TC6)  -SRCNF:4)+SI  CNR5) 

T (5) *SR CNR4) -S I (NR5) 
TC8)«-SICNR6)-SRCNR7) 

T(7) *-SI (NR6) +SR (HR7) 
TC9)«SRCNR8)+SRCNR14) 

T( 15) “SR CNR8) -SR  CNR  14) 

T( 13) --SI CNR10)-S I  CNR  12) 
TCll)-SICNRl0)-SICNR12) 

TC 16) -SR  CHRIS) -SR  CNR  17) 

TC 12) -SR  CNR  1 1) -SR  CNR  17) 

TC 10) --SI CNR9) -S I  CNR  16) 

TC 14) --S I  CNR  16) +S I  CNR  13) 
SRCNR2) -T(5)+TC7) 
SR6-TC5)-TC7) 

Sni0-T(6)+TCB) 

SRCNR 14) -TC6)-TC8) 

QC7)-TC9)+TC10) 

OC8)-T(S)-T(10) 

OC 1) -T( 1 1) +T( 12) 
QC2)-TCll)-TC12) 

QC4) -T  C 14) +T( 15) 
QC5)-TC15)-T( 14} 
QC3)-T(13)+TC1G) 

QC6) -T( 13)-T( 16) 

SRCNR1) -QC3)+QC7) 

SRCNR7) -QC7)-0(3) 
SR9-QCB)+0(6) 

SR  CNR  15) -C 18) -0(6) 

SRS-QC 1)+QC4) 

SR3-0C4)-0Cl) 

SR13-0C2)+QC5) 

SRU-0(5)-QC2) 

SR(NR8)-TC2) 

SRCNR4) -TC3) 


SR12«T(4) 

T(2) “SI C NBASE 5  — S I  CNR  1 5 

S I  (NEftSE)  “SI  (NRD+SI (NBfiSE) 

T(4)“SI(Kn2)-SR(NR3) 

T (3) “S I (NR2) +SR (NR3) 

T(6) “SI (NR4) -SR (NR5) 

T(5)  *SI  (NR4)+SR(NR5) 

T  (8) “SR(NR6)-SI (NR?) 

T(?) “SRCNR£)+SI (NR?) 

T (9) *S I (NR8) +S I (HR  14) 

T( 15) “SI (NRB) -S I (NR  14) 
T(13)“SR(NR10)+SR(KR12) 

T(  1  t)“SR(NR12)-SP.  (HR1G) 
T(16)*SI(NR15)-SI(NRt?) 

T( 12) *SI (NR  1 1)-S I (NR1?) 
T(10)“SR(NR9)+SR(NR16) 

SI (NR2)*T(5)+T(?) 

S I (NRG) «T(5) -T(?) 
SI(NR10)“T(G)+T(B) 

SI (NR  14) “T(6)-T(B) 

Q (?) “T (9) +T ( 10) 

Q(8) «T(9)-T( 10) 
G(I)«T(ll)+T(12) 

0(2) “T( 1 l)-T( 12) 

0 (4) “T( 14) +T( 15) 

0(5) “T ( 15)-T( 14) 

0(3) “T( 13)+T( 16) 

0(6) “T( 13)-T ( IS) 

SI(NRl) «0(3)+0(?) 

S I  (NR?) »Q(7)-Q (3) 

S I (NR9) «0(8) +0 (6) 

SKNR15)  “Q(8)-G(G) 

S I (NRS) “Q( 1 ) +Q (4) 

SI (NR3) “0(4) -0( 1 ) 

S I  (NR 13)-Q(2)+Q(5) 
SI(NRU)-Q(5)-0(2) 

S I (NR8) “T (2) 

SI(NR4)-T(3) 

SI (NR12) “T(4) 

SR(NR3)“SR3 
SR (NR5) -SR5 
SR(NR6) “SR6 
SR(NR9) -SR9 
SR (NR 10) -SR  10 
SR (NR  11) «SR  1 1 
SR(NR12) “SR12 
SR(NR13)“SR13 
1620  NBftSE-NEftSE+16 
1630  NBASE “NBASE+NLUP2 
1640  NBASE “NBASE+NLUP23 
RETURN 
END 


APPENDIX  D 


Prime  Factor  Algorithm  Program 
This  appendix  contains  the  driver  and  subroutine  for 
executing  and  timing  the  prime  factor  algorithm  (PFA) 
developed  by  Burrus  (Burrus  and  Eschenbacher ,  1981).  The 

program  consists  of  a  single  subroutine  called  using  the 
format 

CALL  PFA(XR,XI,A,B,N,NFT,NI,UNSC,XIN,XOUT) 
where  the  variables  are  defined  as  follows. 

XR  -XR  is  a  real  array  of  dimension  N.  It  stores  the  real 
component  of  the  input  sequence. 

XI  -XI  is  a  real  array  of  dimension  N.  It  stores  the 

imaginary  component  of  the  input  sequence. 

A  -A  is  a  real  array  of  dimension  N.  It  contains  the 

real  component  output  sequence. 

B  -B  is  a  real  array  of  dimension  N.  It  contains  the 

imaginary  component  of  the  output  sequence. 

N  -N  is  an  integer  variable  specifying  the  length  of  the 
sequence.  N  must  be  a  product  of  relatively  prime 
factors  from  the  set  2,  3,  4,  5,  7,  8,  9,  and  16. 

NFT  -NFT  is  an  integer  specifying  the  number  of  relatively 
prime  factors  in  N. 

NI  -NI  is  an  integer  array  of  dimension  NFT  where  each 

element  is  a  factor  of  N. 

UNSC  -UNSC  is  an  integer  unscrambling  constant  used  to 
permute  the  output  sequence  into  its  proper  order. 
UNSC  is  found  by 
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c 

c  ***  SMART  OF  ECV  I.  ROUTINE  B  ***> 

C 

C 

r  *******■:*::  OWM:***1--'  ^'tttto*#zaWT!>&cXXa&Kila&a&*aiDlc&K^^ 

p 

C  PROGRAM  BUP.RU5  PRIME  FACTOR  ALGORITHM  DRIVER 
C 

C  TH T S  DRIVER  HRS  EF:£ I C ALLY  BEEN  COPIED  FROM  THE  PRIME  FACTOR 
0  ALGORITHM  DRIVER  RECEIVED  FROM  ROUTE  1 

“  I 

0  AUTH":  MARL  MEHALIC 

C  DATE :  2?  AFR  B3 

C  VEFEIOi::  I.E 

C  CALL  INC  MODULES :  NONE  (THIS  IS  THE  WIN  PROGRAM) 

C  MODULES  CALLED:  EPFA.  GRAPH  i 

e  i 

C*#*'::'**'#-* 

C 

r- IKENS ION  >1?  (504) ,  XI  (504) .  A (504) , B (504) 

DIMENSION  Ff.EC (504) 

DIMENSION  NIC 3) 

F.EAl  MAG  (EE4) 

INTEGER  I'NSC 

p 

C  CHANGE  THE  VALUE  OF  T  TO  CHANGE  THE  SEQUENCE  LENGTH  (^100) 

C 

H  -  504 

t  *  i;/iee 

NFT'E 

r 

C  THE  FACTORS  HI  ( 1) »  NK2).  NI  (3)  MUST  BE  CHANGED  FOR  DIFFERENT  LENGTH; 

r 

NI ( 1) *9 
NI ;2) «E 
N I (3) *" 

DELT-.B1 
L'NSC  •  19 1 
C 

C  DEFINE  THE  FREQUENCY  RECORD  LINGTH.DELF: 

C  * 

IELF-1.0/T 

C 

C  GENERATE  THE  P-FRAY  TO  BE  TRANSPORTED: 

F 1*3. 1415S265E5B98 

E-2.7182E 

FQ-25.0 

DO  10  I-l.N 

T-(I-1)*DELT 

ARG«2.0*PI*FQ*T 

TE—  t.0*T 


0-3 


XI  ( D  *0 . 

10  XRC)«lE**7E)*C0SlARG) 

PRINT*. '  7E£r'‘ 

C 

C  ***♦****•>;*  »-y*otu>i:>;  ■  :%-*q^-sori 

c 

C  CALL  T-:E  PRIME  rfiCTC-r:  ALGORITHM  SUBROUTINE  AND  DETERMINE  TIMING 
r 

C 

CALL  FrAOT.YI.A.E.N.NFT.NI.UNSC.XIN.XOUT) 

FEINT*. 'TEST' 

C 

C  GENERATE  THE  REAL.  IMAGINARY,  AND  MAGNITUDE  ARRAYS 
DO  36  1*1. 1. 

36  MAG 1 I'  •(  t  (:( l)**2l+<  BCI)**2))**0.5 
FRUIT*. 'TEST' 

r 

C  GENERATE  the  frequency  array,  as  per  convene  ion. 

C  FRECL-EKCY  TERMS  AGILE  THE  FOLDING  FRECjENCY  F/2 
C  ARE  ASSIGNED  TG  NEGATIVE  FREQUENCIES. 

r 


*  otofOloioiojoMoioK 


L 

C  PRINT  THE  MAGNITUDE  AND  FREQUENCY  ARRAYS 


r 

u 

C  Me  V:NoMo**aWalaMoMofc>MoMof:)Mof3M^ 

c 

DO  50  1*1, N 

<0  FF.EC 1 1) •  i (I  - 1 5 »T ELR ) -50 . 0 

PRINT* .FRED ( I) .  MAGf I) 

53  CONTINUE 

PRINT*. 'TEST' 


C 

C  CALL  GF.hr  H  TO  PLOT  THE  MAGNITUDE  ARRAY 

C  _________ 

C 

CALL  GRAPH(MAG.N) 

C 

C  END  OF  MAIN  PROGRAM 


pr  T  T*-.',£iT  O.’EF* 

SVC 

ENI 

C 

c  ***  END  OF  BOX  I.  ROUTINE  B  Mery****** 
C 


0-4 


u  U  U  «  J  U  U  U  J  U  t.)  uuuuuuu 


_  . •****¥»  *^MGkx#r#*c**cw 

u 

L  SUERDL^It;:  EJRRUS  PRIME  FACTOR  ALGOR  I  “VIM 
C 

C  PACT:  ?7  fipr  22 

:  VERSION:  l.P 

c  this  sue  "Gut  i  !■«=  uss  taken  from  as  article  written  by  c.  Sidney 

-  BLT.r.JE .  IT  HAS  BEEN  MODIFIED  TO  RUN  Ch  A  PDP  11/50  USING 

C  THE  F4P  FE'-.TRftli  COMPILER. 

r 

Z  #^*****>:  :>«  :>»-X '  So*c>K^"M^f^^>K^>K>t'.>K^o#o*o*o|o4o|T>|o!:>K^H?^K?|rHo*o#C3*c3#oioK>*oK3fo#c>ir>K^5^ 

c 

SUBROUTINE  PFACX.  Y,  A.B.N.Mi.NI  .UNSC.X1N.X0UT) 

ft  PRIME  FACTOR  FFT  PROGRAM 

X,Y  -COMPLEX  INPUT  DATA  ARRAYS.  REAL  DATA  IN  X  AND 
IMAGINARY  VALUES  IN  Y. 

A.B  -COMPLEX  OUTPUT  VECTORS.  REAL  VALUES  IN  A  AND 
IMAGINARY  VALUES  IN  B. 

M  -THE  NUMBER  OF  FACTORS  OF  N 
N  -SEQUENCE  LENGTH  UHICH  MUST  BE  FACTORABLE 

BY  MUTUALLY  PRIME  NOS  FROM  THE  SET  C2.3.4.5.7.8.9. 1G) 

N!  -ARRAY  LENGTH  M  CONTAINING  THE  FACTORS  OF  N. 

UNSC  -UNSCRAMBLING  CONSTANT  EOUAL  TD  N/H!  C 1.'  +N/NI (2)  + 

_ +N/NKM) . 

PROGRAM  BY  C.S.  6UPRUS 
RICE  UNIVERSITY.  AcG  1930 

riMENSIGM  XCM; ,  YUM  .  fi(N)  ,BCN) 

INTEGcF  NIC4).  ICE),  UNSC 

IATA  C3I,  C32  /  0.6663254.  0.500COGC  / 

DATA  CSI,  C52  /  e.95:B5£52,  1.S3E941S  / 

IATA  C53.C54  /0. 36227126.  B.5590169S/ 

DATA  C55  /-I.25  / 

DATA  C71.  C72  /  -1.1SSS6SS7.  8.79015647  / 

DATA  C73.  C74  /  0.B555542S7.  0.7343022  / 

DATA  C75.  C76  /B. 44095855.  0.340B729Z  / 

DATA  C77.  C7B  /  0.5339^936.  0.B7484229  / 

DATA  CBi  /E. 707 10670/ 

DATA  CSS.  C93  /  0.53565262.  -0.17364218  / 

DATA  C54,  C9S  /0. 76604444,  -0.34202014  / 

DATA  CS7.  CEB  /  -0.SS4SC'75,  -0.6427E751  / 

DATA  C ICC.  C16Z  /  0.3B26S343.  1.30656257  / 

DATA  C 164,  C 165/  0.54119610,  0.92387953  / 

XIN-SECISDSC0.0) 

DO  IE  K»  1 .  M 

Nl-NHK) 

N2-N/H1 

DO  20  J-l.N.NI 


-  ■  - 


! t 1 ) -J 
IT- J 

DC  SC  L-2.N1 
IT-IT+;,2 

IF ( IT. GT.K)  IT«! T-N 
I  CL) -IT 

30  CDNTIli'JE 

GO  TO  f  2C. 103. 103. 104. 105.20. 107. 10S. 109. 
C  20.20,20. 26,20.20. 1 16). N1 

C 

C  UFTfi  N-2 
C 

102  Tl-XCICI)) 

X-  I I ) )  -Tl+XC !  (2) ) 

XC:;2>)-71-XCI(2)) 

Tl-YCIC!)) 

YC !(!))-' i+Yi 1(2)) 

YC I C2) ) -Tl-YC I (2; ) 

GC  t  G  2C 
C 

C  LFTft  H-2 

r 

103  Tl»(X(!(2))-X'ICZ)')*0Zl 
Ul-CYC i (2) )-Y( I (2) ) )*231 
R1-XCK2) j+Xf 1(3)) 

E l-YC 1(2)) +Y( 1(3)) 

T2-XC I ( 1) )-Ri*€32 
U2«YCI(1))-S!*:32 
x<  i  ( i)  >  *x(  i  ( i )  )+r  i 

YCIU))«Y(m))+Sl 
y<:  i  (2)  ?  -T2-r'j  i 

X  I  (?)  )  -T2-L'  1 
Y‘  !  (2) ) 

Y(  I  (2)  )«L2-*-Ti 
GO  TO  20 

C 

C  UFTF.  N-4 
C 

104  Rl-X(I(l))+X(I(Z>) 

F.2-xn(I))-X(I(3)) 

51- Yf l ( 1) )+Y( I (3) ) 

52- Y< I ( 1) )-Y( I (3) )  * 

R3-Xt I (2) ) +X( 1(4)) 

F!4»X<  I  (2)  )-X(  I  (4) ) 

53- Y: 1 12) )+Y( I (4) ) 

54- YC I (2) )-Yl 1(4)) 

X(  I  ( 1 ) )  -Pi-F.3 

X(  I  (3) ) -R 1-R3 
Y(IC1))-S:+S3 
YC  I (3) ) -S1-S3 
XCIC2) )»R2+S4 
X(IU))-R2-S4 
Y(I(2))*S2-R4 
Y(I(4))-S2+R4 
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GO  TO  20 


C 


UFTft  N* 

r 

Rl«> 

..  !(2>)+Xf 

:0) 

R2*> 

1  <2) )-X( 

:  1 5  ) 

S'.*'. 

,  i 1 2 ) t  +y . 

1 i5) ) 

S2=Y 

'(  i  f2'  )-Yi 

i  <  e  ) ) 

RS«>; 

:  (4) ) 

R4=X 

:C(3))-x;: 

K4n 

c.j-y 

>  I  (2)  >  +Y 

\  14)) 

e<,v 

!  (3);-Yl 

:i4)) 

I!*;. 

R2+F.4.i  *25 

i 

ui-r 

C?xCk-,  y;-r. 

k 

F:2«7!-R2sCE2 
S2*L'  i-S2*052 
R4«-7!-R4*C5I 

54- L:-S<f'052 
Tl«lR!-R3)*:54 
U ! *  f S ! -S3) *254 
T«:  «r.  1  +p.Z 

U2«S  l-!-S3 

X:  iU))»X(I(!>>+72 
va(i))-vc(i5:+u2 
72  eXC I ( 1 ) ) +T2*C55 
U2*Y(  I  ( 1) )+U2*C55 
F!«TZvT1 
RZ*“-T1 
Si*J2+Ul 
S3-'J2-l'l 
X:'id)>«Rl+S4 
Xr:r2')«R!-£4 
YHf2))-Sl-R4 
Vi  i  i.S) )  »£ 1+R4 
Xv :  C3) )  -F.Z-S2 
Xl  I ;  4} )  «F2+S2 
Y(!d>)«£3*»-R2 
Yd(4))-S3-R2 
GC  TO  20 
C 

c 

C  UF7P  N«7 
107  R»;<iC25)+X(I(7)) 

R2«X( I (2) )-X( I (7) ) 
S1-YC!C2))+Y(IC7)) 
S2»YCI(2))-Y(Il7)) 
F7-M  I  (35  )+X(  I  (G) ) 
R4-XvI(3))-xac6)) 
E3-Vf!(3))+YU(G)5 
S4«YfIi3))-Y(I(6)) 
R5-XCI(4))+X(I(55) 
R6*X( I (4) )-XC I (5) ) 

55- Y- IC4))+Y(l(5)) 
Sfi-Yd(4))-Y(l(5)) 
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nnn 


T1-R1+R3+R? 

l+Si+SE 

I  1)  5  T  C  1 1  )+Tl 

Y(  it  p  )  “V(.  i  ( ;)  )+ui 
Tl*Xi  III)) +Ti»€? 1 
ui-vi:(!))h::*[?i 
T2-cr?>!(Ri-r.E.1 
U2-C?2^tS!-FD 
T3-C?3*cR5-F2) 
U3*C-Zv;T5-L?  ' 

!  4C  L  ?  4  P  C  —  t  1  ) 
U4-C74* tSI-Sl) 
R1-T1+T2+T3 
R3*Tl-72-T4 
R5X7I-TZ+T4 
S1*=L>  1+U2tu3 
S3XU’-U2-U4 
£3-U:-L!2+U4 
LI  i  *C75*  ( S2+-S4-SS ) 
T>C?5*(R2+R4-R£) 
72“Crc*tr.2TF:S) 
L2-C?6*(S2+Sf) 
T3'C77*tF:4+R£) 
U3=C?7fc(S4+So) 
T4«C76*CR4-R2) 
iJ4«C78*<S4-S2> 
R2-T1+T2+T3 
R4»T1-T2-T4 
R6x7l-T3+T4 
S2*U1+U2+U3 
S4«U  1-U2-U4 
S6‘LH-U2-KJ4 
X( I (2) ) *R 1+S2 
X<:i(?))=Rl-S2 
Yt I C2) ) "E1-R2 
Yt I(7))-S2+R2 
X< I (3) ?  «R3+S4 
X( I  to) ) »R3-S4 
Y(ItZ))«S3-R4 
Y( I (6) ) «S3+R4 
Xt I f  4) ) «R5-S6 
X< IC5))«F5+S6 
Yt I (4) ) «S5+R6 
Yt I (5) ) »E5-R6 
EC'  TO  20 

UFTF;  N-e 

108  Ri-xtrtm+xatS)) 
R2»xcnn)-x(i(5)) 
S1*Y( I  Cl) )*Y(  I (5) ) 
S2«Y(l(l))-Y(I(5)) 
R3«Xa<2))+X(I(8)) 
R4-X(K2))-X(I(B)) 


S3-Y(l(2))+Y(I(8)) 
S4»Y  I  (2))-Yi  HE)) 
P3«Ki  f  ( 3 ) )  +X  •'  I  (?) ) 
k6*>:;  i  (3)  )-:.*!(?)) 
S5“Vi I (3) )+V i ! (?) ) 
Sc*Y(  i  (3) )  -Yd  (?) ) 
P7*Y(  I  (4) )  +>\  I  (.6) ) 
R6*XlI(4>)-Xi 1(6)) 
Sr«Y(I(4))+\ ll(6)) 
S£*YC 1 14) )-Y( I (6) ) 
7 !  *F.  i  +R5 
T2-R1-P5 
U 1 *81+55 
U2-S1-S5 
T3*P3+R? 
R3*(F':-P7)*C81 
U3*5?-S? 
S3*'S~-SD*Cei 
T4*Ri.-R£ 
P4»(R4+R3)*C81 
114*34-88 
S4*(S4+8E)*CB1 
T5*R2+R3 
T6-R2-P? 

U5-S2+83 

UE-S2-S3 

T7-R4+RG 

T8-R4-R6 

U7-S4+S6 

UB*S4-S6 

X(  I  ( 1) ) «Tl+T3 

XC!(S))-Tl-T3 

Y( I ( 1) ) *U 1+U3 

Y(  I  (5) ) ■U1-U3 

X(H.2)>*T5+U? 

y.C  I  (E) )  *T5- J? 

Y(  I  (2) ) «U5-77 
Y( I  (6) )-U5+T7 
X!  I(3»*T2+U4 
X(I(?))-T2-U4 
Y(If3))*U2-T4 
Y( I  (?) ) *U2+T4 
X( ! (4) ) *T6+UB 
X<:i(6))«T6-U8 
Y( I (4) ) *U6-TE 
Y(  H£; ) *U£+TE 
GC  TO  20 

C 

C  UFTfl  N«9 

C 

109  R1*X( I C2) )+X( I (9) ) 

R2*X( I (2) )-X( 1(9)) 
SI*Y( I (2) )+Y( I (9) ) 
S2*Y(I(2))-Y<I(9)> 


* 
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,  R3«X<I(;r)+xri(B)) 
•vR4*Y:-  1  (S'  '■  -X‘  !  (0) ) 

*53- S  •  i(2>;+YC;  (6j) 

54- VCI(35)-Yi'I(e)) 
R5-XCK4  ))+X<!(?)) 
7--(XlI(4))-X(I(?)5)  #C3 1 

55- YC I (4) 5 +Vt I l?) ) 

U—  (Y(!f4^:-YC!f?)))*C31 
R?-XC(5))+/;i!fej) 

RE-Xi  I  i'S )  )  -Xt  I  (G) ) 

S?-Yil  1  5  1 )  +  Y (.  1 1 1 ) ) 
Se»Y(US))-YCIfg5) 
Rs-xcoj+rs 

S5-YC  Id?)  +ST 
T! mK( 1(2)) -pjrCZ 2 
Ul»Yciti>)-S5»C32 
72*(R3-R7)*~S2 

y3«(§j-g7)S'-f  j 

T3- (f;  1-E?)  xCSI 
L'3-(Ei-S7)*C?Z 
T4*  (R  2  -R2)  !t  Cf  4 
U4-(S1-E3)*CS4 
R  16*P 1+R3+R7 
S  lE-S 1+E3+S? 

R 1*7! +72+74 
R3-71-72-73 
R7-T1+T3-T4 
Si-Ul+L'2+U4 
53-U2-U2-U3 

57- U 1+U3-U4 
X(I<  1 ) )  «R5i-R  18 
Y(  I  ( 1 ) ) -ES+310 
R5*F.F-n  1G*C32 
£5*S?-Sl0>tC22 
R£— .'R2-R4+r6)C3l 
S6=-(SZ-S4+SE)*C31 
T2,(F4+RS)*'£S£ 
U2*(f4+se;*:?g 
T3*  (R2-RE)  *C9? 
U3-IE2-SBX5? 

T4-  (R2+R43  *'C?S 

U4-  (S2+S4)  *C?B  , 

R2-7+T2+T4 

R4-T-72-T3 

F.3-7+73-T4 

S2«L,+U2+U4 

S4«w-U2-U“ 

58- U+U3-U4 
XCI(2))«Rl-S2 
X(  I (9) ) -R1+S2 
Y(I(2))-S1+R2 
Y(KF))-S1-F2 
X(I(T))-R3+o4 
X(I(6))»P.3-S4 
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Y(  I (3) ) «E3-R4 
YCIfP;  )*S3-*-R4 
X(I  <.45 )  *R5-Sb 
.\(I(?;)«R5+S£ 

Y( I (41 ) -S5+R6 
YCI(?))«S5-R6 
X(I(5))«R?-Se 
x<  n£j) -Rr+££ 

Yf  1 1?) )  -S7+FSL 
Y(I(6))«S7-R8 
GO  TO  20 

C 

C  UFTA  H- 16 

C 

116  R1«X(!(1))+X(!(S)) 

R2*X( I ( 1) )-X( I (9) ) 
Sl-Y(iun+Y(!(9?) 
S2*Y( 1(1); -Yi  1 (S; ) 
R3=X(  I  (2) )  -t-XC  1(10)) 
R4-X(l(2)l-X(iue>) 
S3»Vi!C2))+YCI(t0)) 
S4«Y( I (2)  )-Y( I ( 16) ) 
R5«X( I (2) )+X( !(!!)) 
R6-X(I(3))-Xi  1(1!)) 
S5«YC I (3) ) +Y( I ( i 1 ) ) 
S6“Y(  I  (3)  )-Y(  KID) 
R7*XC! (4))+X(I(t2>) 
R8*X( I (4) )-X( I ( 12) ) 
S7«Y(I(4))+Y(I(12)) 
S8“Y( ! (4) )-Y( I ( 12) ) 
RS“X(I(5))+X(I(13)) 
R10«X(!(5))-X(1(13)) 
S9«Y( ! (5>)+Y( 1(13)) 
Sl6*Y(!(5))~Y(!(13)) 
R! 1*XC I (6) )+X( 1(14)) 
R12«X(I(6))-X< 1(14)) 
S11*Y(I(6))+Y(i(l4)) 
S 12-YC I (6) ) -Y( I ( 14) ) 
R13-X(I(7))+X(I(15)) 
R14*X (1(7)) -X (1(15)) 
S 13*Y( I (7) ) +Y( I ( 15) ) 
S14*Y( I(7))-Y(I(15)) 
R:s-X(I(B5)+X(l(16)) 
R16“X( I (B) ) -X( I ( 16) ) 
S15-Y(I(0))+Y(I(16)) 
£16-Y(I(B))-Y(I(16)) 
Tl-Ri+RS 
T2-R1-R9 
U1-S1+S9 
U2-S1-S9 
T3-R3+R11 
T4-R3-R11 
U3-S3+SU 
U4-S3-S11 


( 


Ta-R5+R!3 

T6-F:b-Rl3 

U5-S5+S1Z 

U6-S5-S13 
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